<a href="https://github.com/dd-consulting">
     <img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
    One-Stop Analytics: Predictive Modeling (Regression Models)
</h1>

Case Study of Autism Spectrum Disorder (ASD) with R


[ United States ]

Centers for Disease Control and Prevention (CDC) - Autism Spectrum Disorder (ASD)

Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.

https://www.cdc.gov/ncbddd/autism/data/index.html

[ Singapore ]

TODAY Online - More preschoolers diagnosed with developmental issues

Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.

https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues

https://www.pathlight.org.sg/

<a href="">
     <img src="" width="60" align="right">
</a>

Workshop Objective:

Use R to predict Autism Spectrum Disorder (ASD) prevalence.

https://www.cdc.gov/ncbddd/autism/data/index.html

  • Linear Model: Simple Linear Regression (SLR)

  • Linear Model: Multiple Linear Regression (MLR)

  • Linear Model: Polynomial Regression (PLR)

  • Linear Model: Logistic Regression (LR)

  • Linear Model: Model Evaluation: Train/Test, K-Fold Cross Validation, Confusion Matrix

  • Linear Model: Prevent Overfitting by Regularization Methods

  • Workshop Submission

  • Appendices

<a href="">
     <img src="" width="750" align="center">
</a>
if(!require(repr)){install.packages("repr")}
## Loading required package: repr
library("repr") # Show graphs in-line notebook

Obtain current R working directory

getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"

Set new R working directory

# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
<a href="">
     <img src="" width="750" align="center">
</a>

Linear Model: Simple Linear Regression (SLR)

<h3>
Linear Model: Simple Linear Regression (SLR) - Workshop Task
</h3>

Workshop Task:

    1. Graph the data in a scatterplot to determine if there is a possible linear relationship.
    1. Compute and interpret the linear correlation coefficient, r.
    1. Determine the regression equation for the data.
    1. Graph the regression equation and the data points.
    1. Identify potential influential observations (outliers).
    1. At the 5% significance level, do the data provide sufficient evidence to conclude that the slope of the population regression line is not 0 and, hence, that [ Year ] is useful as a predictor of ASD [ Prevalence ]?
    1. Obtain the residuals and create a residual plot. Decide whether it is reasonable to consider that the assumptions for regression analysis are met by the variables in questions.
    1. Compute and interpret the coefficient of determination, \(R^2\).
    1. Find the predicted ASD Prevalence of future Year.
    1. Determine a 95% confidence interval for the predicted ASD Prevalence.

Use Case Data: “../dataset/ADV_ASD_State_R.csv”

Read in CSV data, storing as R dataframe

# Read back in above saved file:
ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# Convert Year_Factor to ordered.factor
ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) 
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, 
                                    levels=c("Low", "High"))
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, 
                                    levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
##   State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1    AZ       45322        6.5      5.8      7.3 2000   addm
## 2    GA       43593        6.5      5.8      7.3 2000   addm
## 3    MD       21532        5.5      4.6      6.6 2000   addm
## 4    NJ       29714        9.9      8.9     11.1 2000   addm
## 5    SC       24535        6.3      5.4      7.4 2000   addm
## 6    WV       23065        4.5      3.7      5.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2 Numerator_ASD Numerator_NonASD  Proportion
## 1        AZ-Arizona           295            45027 0.006508980
## 2        GA-Georgia           283            43310 0.006491868
## 3       MD-Maryland           118            21414 0.005480215
## 4     NJ-New Jersey           294            29420 0.009894326
## 5 SC-South Carolina           155            24380 0.006317506
## 6  WV-West Virginia           104            22961 0.004508996
##   Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1                      5.798905                      7.303948             9.7
## 2                      5.769431                      7.302595            11.0
## 3                      4.557351                      6.583638             8.6
## 4                      8.814705                     11.102544            14.8
## 5                      5.381662                      7.411085             9.3
## 6                      3.703408                      5.483723             6.6
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1           8.5          11.1               3.2             2.5             4.0
## 2           9.7          12.4               2.0             1.5             2.7
## 3           7.1          10.6               2.2             1.5             2.7
## 4          13.0          16.8               4.3             3.3             5.5
## 5           7.8          11.2               3.3             2.4             4.5
## 6           5.2           8.2               2.4             1.6             3.5
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                           8.6                         7.5
## 2                           7.9                         6.7
## 3                           4.9                         3.8
## 4                          11.3                         9.5
## 5                           6.5                         5.2
## 6                           4.5                         3.7
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                         9.8                           7.3
## 2                         9.3                           5.3
## 3                         6.4                           6.1
## 4                        13.3                          10.6
## 5                         8.2                           5.8
## 6                         5.5                            NA
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                         4.4                        12.2                  NA
## 2                         4.4                         6.4                  NA
## 3                         4.7                         8.0                  NA
## 4                         8.5                        13.1                  NA
## 5                         4.5                         7.3                  NA
## 6                          NA                          NA                  NA
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1                NA                NA                                   NA
## 2                NA                NA                                   NA
## 3                NA                NA                                   NA
## 4                NA                NA                                   NA
## 5                NA                NA                                   NA
## 6                NA                NA                                   NA
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                                 NA                                 NA
## 2                                 NA                                 NA
## 3                                 NA                                 NA
## 4                                 NA                                 NA
## 5                                 NA                                 NA
## 6                                 NA                                 NA
##         State_Region Source_UC
## 1        D8 Mountain      ADDM
## 2  D5 South Atlantic      ADDM
## 3  D5 South Atlantic      ADDM
## 4 D2 Middle Atlantic      ADDM
## 5  D5 South Atlantic      ADDM
## 6  D5 South Atlantic      ADDM
##                                                  Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network              Low
##   Prevalence_Risk4 Year_Factor
## 1           Medium        2000
## 2           Medium        2000
## 3           Medium        2000
## 4           Medium        2000
## 5           Medium        2000
## 6              Low        2000
# Filter [ Source: ADDM ], including only two clomuns for SLR:
# Dependent variable: Prevalence
# independent variable: Year
ASD_State_4_SLR = subset(ASD_State, Source_UC == 'ADDM', select = c(Prevalence, Year))
#
dim(ASD_State_4_SLR)
## [1] 86  2
head(ASD_State_4_SLR)
##   Prevalence Year
## 1        6.5 2000
## 2        6.5 2000
## 3        5.5 2000
## 4        9.9 2000
## 5        6.3 2000
## 6        4.5 2000

SLR Workshop Task: 1. a. Graph the data in a scatterplot to determine if there is a possible linear relationship.

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
plot(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)


SLR Workshop Task: 2. b. Compute and interpret the linear correlation coefficient, r.

Compute correlaion coefficient

cor(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
## [1] 0.7224098

Apply correlation test (two tail: != 0)

cor.test(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
## 
##  Pearson's product-moment correlation
## 
## data:  ASD_State_4_SLR$Year and ASD_State_4_SLR$Prevalence
## t = 9.5753, df = 84, p-value = 4.13e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6027995 0.8102653
## sample estimates:
##       cor 
## 0.7224098

Apply correlation test (one tail: > 0)

cor.test(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence, alternative = "greater")
## 
##  Pearson's product-moment correlation
## 
## data:  ASD_State_4_SLR$Year and ASD_State_4_SLR$Prevalence
## t = 9.5753, df = 84, p-value = 2.065e-15
## alternative hypothesis: true correlation is greater than 0
## 95 percent confidence interval:
##  0.6243611 1.0000000
## sample estimates:
##       cor 
## 0.7224098

SLR Workshop Task: 3. c. Determine the regression equation for the data.

fit_model = lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
print(fit_model)
## 
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
## 
## Coefficients:
## (Intercept)         Year  
##  -1737.8464       0.8714

SLR Workshop Task: 4. d. Graph the regression equation and the data points.

plot(ASD_State_4_SLR$Year, ASD_State_4_SLR$Prevalence)
abline(fit_model, col="blue", lwd=2)


SLR Workshop Task: 5. e. Identify potential influential observations (outliers).

# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
par(mfrow=c(2, 2)) 
plot(fit_model)

par(mfrow=c(1, 1))

[ Tips ] We notice:

  • Based on Residual vs Leverage chart, there seems no potential influential observations (outliers)

SLR Workshop Task: 6. f. At the 5% significance level, do the data provide sufficient evidence to conclude that the slope of the population regression line is not 0 and, hence, that [ Year ] is useful as a predictor of ASD [ Prevalence ]?

summary(fit_model)
## 
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9888 -2.7032 -0.0104  1.7397 12.1255 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.738e+03  1.827e+02  -9.513 5.51e-15 ***
## Year         8.714e-01  9.101e-02   9.575 4.13e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.688 on 84 degrees of freedom
## Multiple R-squared:  0.5219, Adjusted R-squared:  0.5162 
## F-statistic: 91.69 on 1 and 84 DF,  p-value: 4.13e-15

[ Tips ] We notice:

  1. F-test’s p-value is 4.13e-15, which is smaller than 0.05, thus above 95% confidence.

SLR Workshop Task: 7. g. Obtain the residuals and create a residual plot. Decide whether it is reasonable to consider that the assumptions for regression analysis are met by the variables in questions.

# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
par(mfrow=c(2, 2)) 
plot(fit_model)

par(mfrow=c(1, 1))

[ Tips ] We notice:


SLR Workshop Task: 8. h. Compute and interpret the coefficient of determination, \(R^2\).

summary(fit_model)
## 
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_SLR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9888 -2.7032 -0.0104  1.7397 12.1255 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.738e+03  1.827e+02  -9.513 5.51e-15 ***
## Year         8.714e-01  9.101e-02   9.575 4.13e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.688 on 84 degrees of freedom
## Multiple R-squared:  0.5219, Adjusted R-squared:  0.5162 
## F-statistic: 91.69 on 1 and 84 DF,  p-value: 4.13e-15

[ Tips ] We notice:

  • \(R^2\) is 0.5219

  • Adjusted \(R^2\) is 0.5162


SLR Workshop Task: 9. i. Find the predicted ASD Prevalence of future Year.

future_year = 2025
newdata = data.frame(Year = future_year) 
predict(fit_model,newdata)
##        1 
## 26.75998
#
cat("Predicted ASD Prevalence of Year [", future_year, "] is", round(predict(fit_model,newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence of Year [ 2025 ] is 26.8 per 1,000 Children

SLR Workshop Task: 10. j. Determine a 95% confidence interval for the predicted ASD Prevalence.

predict(fit_model, newdata, interval = "predict")
##        fit      lwr      upr
## 1 26.75998 18.72351 34.79645
cat("\nPredicted ASD Prevalence of Year [", future_year, "] (95% Upper CI) is", 
    round(predict(fit_model,newdata, interval = "predict")[3], 1), "per 1,000 Children")
## 
## Predicted ASD Prevalence of Year [ 2025 ] (95% Upper CI) is 34.8 per 1,000 Children
cat("\nPredicted ASD Prevalence of Year [", future_year, "] (95% Lower CI) is", 
    round(predict(fit_model,newdata, interval = "predict")[2], 1), "per 1,000 Children")
## 
## Predicted ASD Prevalence of Year [ 2025 ] (95% Lower CI) is 18.7 per 1,000 Children

<h3>
    Quiz:
</h3>
<p>
    Create Prevalence ~ Year SLR model for Data Source: SPED
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<a href="">
     <img src="" width="750" align="center">
</a>

Linear Model: Multiple Linear Regression (MLR)

<h3>
Linear Model: Multiple Linear Regression (MLR) - Workshop Task
</h3>

Workshop Task:

    1. Get the data.
    1. Discover and visualize the data to gain insights (Is there missing Value in the dataframe, then how to deal with the missing value)
    1. Visualize Data and trends
    1. Compute correlation between variables and apply multiple regression.
    1. Check multicollinearity, then how to remove multicollinearity.
    1. How is your final model looks like?

MLR Workshop Task: 1. a. Get the data.

Use Case Data: “../dataset/ADV_ASD_State_R.csv”

Read in CSV data, storing as R dataframe

# Read back in above saved file:
# ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
# ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
##   State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1    AZ       45322        6.5      5.8      7.3 2000   addm
## 2    GA       43593        6.5      5.8      7.3 2000   addm
## 3    MD       21532        5.5      4.6      6.6 2000   addm
## 4    NJ       29714        9.9      8.9     11.1 2000   addm
## 5    SC       24535        6.3      5.4      7.4 2000   addm
## 6    WV       23065        4.5      3.7      5.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2 Numerator_ASD Numerator_NonASD  Proportion
## 1        AZ-Arizona           295            45027 0.006508980
## 2        GA-Georgia           283            43310 0.006491868
## 3       MD-Maryland           118            21414 0.005480215
## 4     NJ-New Jersey           294            29420 0.009894326
## 5 SC-South Carolina           155            24380 0.006317506
## 6  WV-West Virginia           104            22961 0.004508996
##   Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1                      5.798905                      7.303948             9.7
## 2                      5.769431                      7.302595            11.0
## 3                      4.557351                      6.583638             8.6
## 4                      8.814705                     11.102544            14.8
## 5                      5.381662                      7.411085             9.3
## 6                      3.703408                      5.483723             6.6
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1           8.5          11.1               3.2             2.5             4.0
## 2           9.7          12.4               2.0             1.5             2.7
## 3           7.1          10.6               2.2             1.5             2.7
## 4          13.0          16.8               4.3             3.3             5.5
## 5           7.8          11.2               3.3             2.4             4.5
## 6           5.2           8.2               2.4             1.6             3.5
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                           8.6                         7.5
## 2                           7.9                         6.7
## 3                           4.9                         3.8
## 4                          11.3                         9.5
## 5                           6.5                         5.2
## 6                           4.5                         3.7
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                         9.8                           7.3
## 2                         9.3                           5.3
## 3                         6.4                           6.1
## 4                        13.3                          10.6
## 5                         8.2                           5.8
## 6                         5.5                            NA
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                         4.4                        12.2                  NA
## 2                         4.4                         6.4                  NA
## 3                         4.7                         8.0                  NA
## 4                         8.5                        13.1                  NA
## 5                         4.5                         7.3                  NA
## 6                          NA                          NA                  NA
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1                NA                NA                                   NA
## 2                NA                NA                                   NA
## 3                NA                NA                                   NA
## 4                NA                NA                                   NA
## 5                NA                NA                                   NA
## 6                NA                NA                                   NA
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                                 NA                                 NA
## 2                                 NA                                 NA
## 3                                 NA                                 NA
## 4                                 NA                                 NA
## 5                                 NA                                 NA
## 6                                 NA                                 NA
##         State_Region Source_UC
## 1        D8 Mountain      ADDM
## 2  D5 South Atlantic      ADDM
## 3  D5 South Atlantic      ADDM
## 4 D2 Middle Atlantic      ADDM
## 5  D5 South Atlantic      ADDM
## 6  D5 South Atlantic      ADDM
##                                                  Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network              Low
##   Prevalence_Risk4 Year_Factor
## 1           Medium        2000
## 2           Medium        2000
## 3           Medium        2000
## 4           Medium        2000
## 5           Medium        2000
## 6              Low        2000
names(ASD_State)
##  [1] "State"                               
##  [2] "Denominator"                         
##  [3] "Prevalence"                          
##  [4] "Lower.CI"                            
##  [5] "Upper.CI"                            
##  [6] "Year"                                
##  [7] "Source"                              
##  [8] "Source_Full1"                        
##  [9] "State_Full1"                         
## [10] "State_Full2"                         
## [11] "Numerator_ASD"                       
## [12] "Numerator_NonASD"                    
## [13] "Proportion"                          
## [14] "Chi_Wilson_Corrected_Lower.CI"       
## [15] "Chi_Wilson_Corrected_Upper.CI"       
## [16] "Male.Prevalence"                     
## [17] "Male.Lower.CI"                       
## [18] "Male.Upper.CI"                       
## [19] "Female.Prevalence"                   
## [20] "Female.Lower.CI"                     
## [21] "Female.Upper.CI"                     
## [22] "Non.hispanic.white.Prevalence"       
## [23] "Non.hispanic.white.Lower.CI"         
## [24] "Non.hispanic.white.Upper.CI"         
## [25] "Non.hispanic.black.Prevalence"       
## [26] "Non.hispanic.black.Lower.CI"         
## [27] "Non.hispanic.black.Upper.CI"         
## [28] "Hispanic.Prevalence"                 
## [29] "Hispanic.Lower.CI"                   
## [30] "Hispanic.Upper.CI"                   
## [31] "Asian.or.Pacific.Islander.Prevalence"
## [32] "Asian.or.Pacific.Islander.Lower.CI"  
## [33] "Asian.or.Pacific.Islander.Upper.CI"  
## [34] "State_Region"                        
## [35] "Source_UC"                           
## [36] "Source_Full3"                        
## [37] "Prevalence_Risk2"                    
## [38] "Prevalence_Risk4"                    
## [39] "Year_Factor"
# Filter to include relevant clomuns for MLR:
# Dependent variable: Prevalence
# independent variable: Let's include all at the moment
ASD_State_4_MLR = ASD_State
#
dim(ASD_State_4_MLR)
## [1] 1692   39
head(ASD_State_4_MLR)
##   State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1    AZ       45322        6.5      5.8      7.3 2000   addm
## 2    GA       43593        6.5      5.8      7.3 2000   addm
## 3    MD       21532        5.5      4.6      6.6 2000   addm
## 4    NJ       29714        9.9      8.9     11.1 2000   addm
## 5    SC       24535        6.3      5.4      7.4 2000   addm
## 6    WV       23065        4.5      3.7      5.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2 Numerator_ASD Numerator_NonASD  Proportion
## 1        AZ-Arizona           295            45027 0.006508980
## 2        GA-Georgia           283            43310 0.006491868
## 3       MD-Maryland           118            21414 0.005480215
## 4     NJ-New Jersey           294            29420 0.009894326
## 5 SC-South Carolina           155            24380 0.006317506
## 6  WV-West Virginia           104            22961 0.004508996
##   Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1                      5.798905                      7.303948             9.7
## 2                      5.769431                      7.302595            11.0
## 3                      4.557351                      6.583638             8.6
## 4                      8.814705                     11.102544            14.8
## 5                      5.381662                      7.411085             9.3
## 6                      3.703408                      5.483723             6.6
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1           8.5          11.1               3.2             2.5             4.0
## 2           9.7          12.4               2.0             1.5             2.7
## 3           7.1          10.6               2.2             1.5             2.7
## 4          13.0          16.8               4.3             3.3             5.5
## 5           7.8          11.2               3.3             2.4             4.5
## 6           5.2           8.2               2.4             1.6             3.5
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                           8.6                         7.5
## 2                           7.9                         6.7
## 3                           4.9                         3.8
## 4                          11.3                         9.5
## 5                           6.5                         5.2
## 6                           4.5                         3.7
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                         9.8                           7.3
## 2                         9.3                           5.3
## 3                         6.4                           6.1
## 4                        13.3                          10.6
## 5                         8.2                           5.8
## 6                         5.5                            NA
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                         4.4                        12.2                  NA
## 2                         4.4                         6.4                  NA
## 3                         4.7                         8.0                  NA
## 4                         8.5                        13.1                  NA
## 5                         4.5                         7.3                  NA
## 6                          NA                          NA                  NA
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1                NA                NA                                   NA
## 2                NA                NA                                   NA
## 3                NA                NA                                   NA
## 4                NA                NA                                   NA
## 5                NA                NA                                   NA
## 6                NA                NA                                   NA
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                                 NA                                 NA
## 2                                 NA                                 NA
## 3                                 NA                                 NA
## 4                                 NA                                 NA
## 5                                 NA                                 NA
## 6                                 NA                                 NA
##         State_Region Source_UC
## 1        D8 Mountain      ADDM
## 2  D5 South Atlantic      ADDM
## 3  D5 South Atlantic      ADDM
## 4 D2 Middle Atlantic      ADDM
## 5  D5 South Atlantic      ADDM
## 6  D5 South Atlantic      ADDM
##                                                  Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network              Low
##   Prevalence_Risk4 Year_Factor
## 1           Medium        2000
## 2           Medium        2000
## 3           Medium        2000
## 4           Medium        2000
## 5           Medium        2000
## 6              Low        2000

MLR Workshop Task: 2. b. Discover and visualize the data to gain insights (Is there missing Value in the dataframe, then how to deal with the missing value).

summary(ASD_State_4_MLR)
##      State       Denominator        Prevalence        Lower.CI    
##  AZ     :  40   Min.   :    965   Min.   : 0.400   Min.   : 0.30  
##  MD     :  40   1st Qu.: 107151   1st Qu.: 3.100   1st Qu.: 2.80  
##  GA     :  39   Median : 353328   Median : 5.600   Median : 5.30  
##  MO     :  39   Mean   : 604689   Mean   : 7.191   Mean   : 6.42  
##  NC     :  39   3rd Qu.: 767928   3rd Qu.: 9.200   3rd Qu.: 8.60  
##  WI     :  39   Max.   :5824922   Max.   :42.700   Max.   :29.90  
##  (Other):1456                                                     
##     Upper.CI           Year       Source   
##  Min.   : 0.600   Min.   :2000   addm: 86  
##  1st Qu.: 3.300   1st Qu.:2003   medi:655  
##  Median : 5.900   Median :2007   nsch: 98  
##  Mean   : 8.262   Mean   :2007   sped:853  
##  3rd Qu.: 9.700   3rd Qu.:2011             
##  Max.   :69.000   Max.   :2016             
##                                            
##                                                  Source_Full1
##  Autism & Developmental Disabilities Monitoring Network: 86  
##  Medicaid                                              :655  
##  National Survey of Children's Health                  : 98  
##  Special Education Child Count                         :853  
##                                                              
##                                                              
##                                                              
##          State_Full1              State_Full2   Numerator_ASD    
##  Arizona       :  40   AZ-Arizona       :  40   Min.   :   11.0  
##  Maryland      :  40   MD-Maryland      :  40   1st Qu.:  463.8  
##  Georgia       :  39   GA-Georgia       :  39   Median : 1478.5  
##  Missouri      :  39   MO-Missouri      :  39   Mean   : 3668.3  
##  North Carolina:  39   NC-North Carolina:  39   3rd Qu.: 4147.5  
##  Wisconsin     :  39   WI-Wisconsin     :  39   Max.   :79041.0  
##  (Other)       :1456   (Other)          :1456                    
##  Numerator_NonASD    Proportion        Chi_Wilson_Corrected_Lower.CI
##  Min.   :    947   Min.   :0.0004079   Min.   : 0.2647              
##  1st Qu.: 106377   1st Qu.:0.0030973   1st Qu.: 2.8266              
##  Median : 352088   Median :0.0055998   Median : 5.2822              
##  Mean   : 601021   Mean   :0.0071888   Mean   : 6.5008              
##  3rd Qu.: 761045   3rd Qu.:0.0092065   3rd Qu.: 8.6202              
##  Max.   :5795215   Max.   :0.0424597   Max.   :32.6669              
##                                                                     
##  Chi_Wilson_Corrected_Upper.CI Male.Prevalence Male.Lower.CI    Male.Upper.CI  
##  Min.   : 0.5491               Min.   : 5.00   Min.   : 4.100   Min.   : 6.20  
##  1st Qu.: 3.3186               1st Qu.:11.30   1st Qu.: 9.475   1st Qu.:13.12  
##  Median : 5.9473               Median :16.90   Median :14.800   Median :19.05  
##  Mean   : 8.0617               Mean   :18.28   Mean   :16.035   Mean   :20.90  
##  3rd Qu.: 9.7141               3rd Qu.:23.93   3rd Qu.:21.750   3rd Qu.:26.50  
##  Max.   :55.4426               Max.   :45.50   Max.   :42.400   Max.   :48.90  
##                                NA's   :1606    NA's   :1606     NA's   :1606   
##  Female.Prevalence Female.Lower.CI  Female.Upper.CI 
##  Min.   : 1.000    Min.   : 0.600   Min.   : 1.700  
##  1st Qu.: 2.825    1st Qu.: 1.900   1st Qu.: 3.825  
##  Median : 3.700    Median : 2.800   Median : 5.200  
##  Mean   : 4.248    Mean   : 3.229   Mean   : 5.646  
##  3rd Qu.: 5.575    3rd Qu.: 4.475   3rd Qu.: 6.875  
##  Max.   :12.300    Max.   :10.700   Max.   :20.100  
##  NA's   :1606      NA's   :1606     NA's   :1606    
##  Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
##  Min.   : 3.30                 Min.   : 2.300             
##  1st Qu.: 7.45                 1st Qu.: 6.275             
##  Median :12.00                 Median :10.250             
##  Mean   :12.43                 Mean   :10.602             
##  3rd Qu.:16.18                 3rd Qu.:14.250             
##  Max.   :40.00                 Max.   :28.900             
##  NA's   :1606                  NA's   :1606               
##  Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
##  Min.   : 4.100              Min.   : 1.60                
##  1st Qu.: 9.075              1st Qu.: 5.80                
##  Median :13.650              Median : 9.00                
##  Mean   :14.638              Mean   :10.19                
##  3rd Qu.:19.025              3rd Qu.:12.80                
##  Max.   :55.500              Max.   :27.20                
##  NA's   :1606                NA's   :1607                 
##  Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
##  Min.   : 0.900              Min.   : 3.0                Min.   : 0.300     
##  1st Qu.: 4.100              1st Qu.: 8.0                1st Qu.: 4.500     
##  Median : 6.000              Median :12.9                Median : 7.000     
##  Mean   : 7.521              Mean   :14.6                Mean   : 7.891     
##  3rd Qu.: 9.800              3rd Qu.:18.4                3rd Qu.:10.100     
##  Max.   :23.300              Max.   :80.2                Max.   :29.300     
##  NA's   :1607                NA's   :1607                NA's   :1615       
##  Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
##  Min.   : 0.000    Min.   : 2.10     Min.   : 1.000                      
##  1st Qu.: 2.300    1st Qu.: 9.00     1st Qu.: 4.450                      
##  Median : 4.500    Median :11.40     Median : 8.150                      
##  Mean   : 5.462    Mean   :12.59     Mean   : 9.127                      
##  3rd Qu.: 7.500    3rd Qu.:14.60     3rd Qu.:12.350                      
##  Max.   :26.200    Max.   :32.90     Max.   :21.900                      
##  NA's   :1615      NA's   :1615      NA's   :1624                        
##  Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
##  Min.   : 0.200                     Min.   : 7.40                     
##  1st Qu.: 1.475                     1st Qu.:13.28                     
##  Median : 4.700                     Median :18.30                     
##  Mean   : 5.235                     Mean   :18.41                     
##  3rd Qu.: 7.875                     3rd Qu.:22.73                     
##  Max.   :16.000                     Max.   :32.00                     
##  NA's   :1624                       NA's   :1624                      
##                 State_Region Source_UC 
##  D5 South Atlantic    :313   ADDM: 86  
##  D8 Mountain          :265   MEDI:655  
##  D4 West North Central:228   NSCH: 98  
##  D1 New England       :189   SPED:853  
##  D3 East North Central:167             
##  D9 Pacific           :160             
##  (Other)              :370             
##                                                       Source_Full3
##  ADDM Autism & Developmental Disabilities Monitoring Network: 86  
##  MEDI Medicaid                                              :655  
##  NSCH National Survey of Children's Health                  : 98  
##  SPED Special Education Child Count                         :853  
##                                                                   
##                                                                   
##                                                                   
##  Prevalence_Risk2  Prevalence_Risk4  Year_Factor 
##  Low :740         Low      :740     2012   :154  
##  High:952         Medium   :584     2008   :127  
##                   High     :294     2002   :116  
##                   Very High: 74     2004   :111  
##                                     2010   :111  
##                                     2006   :110  
##                                     (Other):963
# Check whether each columns got missing value:
lapply(ASD_State_4_MLR, function(col_x)sum(is.na(col_x)))
## $State
## [1] 0
## 
## $Denominator
## [1] 0
## 
## $Prevalence
## [1] 0
## 
## $Lower.CI
## [1] 0
## 
## $Upper.CI
## [1] 0
## 
## $Year
## [1] 0
## 
## $Source
## [1] 0
## 
## $Source_Full1
## [1] 0
## 
## $State_Full1
## [1] 0
## 
## $State_Full2
## [1] 0
## 
## $Numerator_ASD
## [1] 0
## 
## $Numerator_NonASD
## [1] 0
## 
## $Proportion
## [1] 0
## 
## $Chi_Wilson_Corrected_Lower.CI
## [1] 0
## 
## $Chi_Wilson_Corrected_Upper.CI
## [1] 0
## 
## $Male.Prevalence
## [1] 1606
## 
## $Male.Lower.CI
## [1] 1606
## 
## $Male.Upper.CI
## [1] 1606
## 
## $Female.Prevalence
## [1] 1606
## 
## $Female.Lower.CI
## [1] 1606
## 
## $Female.Upper.CI
## [1] 1606
## 
## $Non.hispanic.white.Prevalence
## [1] 1606
## 
## $Non.hispanic.white.Lower.CI
## [1] 1606
## 
## $Non.hispanic.white.Upper.CI
## [1] 1606
## 
## $Non.hispanic.black.Prevalence
## [1] 1607
## 
## $Non.hispanic.black.Lower.CI
## [1] 1607
## 
## $Non.hispanic.black.Upper.CI
## [1] 1607
## 
## $Hispanic.Prevalence
## [1] 1615
## 
## $Hispanic.Lower.CI
## [1] 1615
## 
## $Hispanic.Upper.CI
## [1] 1615
## 
## $Asian.or.Pacific.Islander.Prevalence
## [1] 1624
## 
## $Asian.or.Pacific.Islander.Lower.CI
## [1] 1624
## 
## $Asian.or.Pacific.Islander.Upper.CI
## [1] 1624
## 
## $State_Region
## [1] 0
## 
## $Source_UC
## [1] 0
## 
## $Source_Full3
## [1] 0
## 
## $Prevalence_Risk2
## [1] 0
## 
## $Prevalence_Risk4
## [1] 0
## 
## $Year_Factor
## [1] 0
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
barplot(apply(ASD_State_4_MLR, 2, function(col_x)sum(is.na(col_x))))

dim(ASD_State_4_MLR)
## [1] 1692   39
#Get all the column variables which contains missing value 
NA_Column_Names <- names(ASD_State_4_MLR[0, colSums(is.na(ASD_State_4_MLR)) > 0])
#
NA_Column_Names
##  [1] "Male.Prevalence"                     
##  [2] "Male.Lower.CI"                       
##  [3] "Male.Upper.CI"                       
##  [4] "Female.Prevalence"                   
##  [5] "Female.Lower.CI"                     
##  [6] "Female.Upper.CI"                     
##  [7] "Non.hispanic.white.Prevalence"       
##  [8] "Non.hispanic.white.Lower.CI"         
##  [9] "Non.hispanic.white.Upper.CI"         
## [10] "Non.hispanic.black.Prevalence"       
## [11] "Non.hispanic.black.Lower.CI"         
## [12] "Non.hispanic.black.Upper.CI"         
## [13] "Hispanic.Prevalence"                 
## [14] "Hispanic.Lower.CI"                   
## [15] "Hispanic.Upper.CI"                   
## [16] "Asian.or.Pacific.Islander.Prevalence"
## [17] "Asian.or.Pacific.Islander.Lower.CI"  
## [18] "Asian.or.Pacific.Islander.Upper.CI"
# Remove these columns from dataframe
ASD_State_4_MLR <- ASD_State_4_MLR[ , !(names(ASD_State_4_MLR) %in% NA_Column_Names)]
#
head(ASD_State_4_MLR)
##   State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1    AZ       45322        6.5      5.8      7.3 2000   addm
## 2    GA       43593        6.5      5.8      7.3 2000   addm
## 3    MD       21532        5.5      4.6      6.6 2000   addm
## 4    NJ       29714        9.9      8.9     11.1 2000   addm
## 5    SC       24535        6.3      5.4      7.4 2000   addm
## 6    WV       23065        4.5      3.7      5.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2 Numerator_ASD Numerator_NonASD  Proportion
## 1        AZ-Arizona           295            45027 0.006508980
## 2        GA-Georgia           283            43310 0.006491868
## 3       MD-Maryland           118            21414 0.005480215
## 4     NJ-New Jersey           294            29420 0.009894326
## 5 SC-South Carolina           155            24380 0.006317506
## 6  WV-West Virginia           104            22961 0.004508996
##   Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI
## 1                      5.798905                      7.303948
## 2                      5.769431                      7.302595
## 3                      4.557351                      6.583638
## 4                      8.814705                     11.102544
## 5                      5.381662                      7.411085
## 6                      3.703408                      5.483723
##         State_Region Source_UC
## 1        D8 Mountain      ADDM
## 2  D5 South Atlantic      ADDM
## 3  D5 South Atlantic      ADDM
## 4 D2 Middle Atlantic      ADDM
## 5  D5 South Atlantic      ADDM
## 6  D5 South Atlantic      ADDM
##                                                  Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network              Low
##   Prevalence_Risk4 Year_Factor
## 1           Medium        2000
## 2           Medium        2000
## 3           Medium        2000
## 4           Medium        2000
## 5           Medium        2000
## 6              Low        2000

No missing values, as they have been handled earlier. Hurrah!

But some varialbe contains “leaky” information, which can be used to directly calculate the dependent variable: Prevalence. This won’t happen in real world scenario, thus they need to be removed.

cbind(names(ASD_State_4_MLR), c(1:length(names(ASD_State_4_MLR))))
##       [,1]                            [,2]
##  [1,] "State"                         "1" 
##  [2,] "Denominator"                   "2" 
##  [3,] "Prevalence"                    "3" 
##  [4,] "Lower.CI"                      "4" 
##  [5,] "Upper.CI"                      "5" 
##  [6,] "Year"                          "6" 
##  [7,] "Source"                        "7" 
##  [8,] "Source_Full1"                  "8" 
##  [9,] "State_Full1"                   "9" 
## [10,] "State_Full2"                   "10"
## [11,] "Numerator_ASD"                 "11"
## [12,] "Numerator_NonASD"              "12"
## [13,] "Proportion"                    "13"
## [14,] "Chi_Wilson_Corrected_Lower.CI" "14"
## [15,] "Chi_Wilson_Corrected_Upper.CI" "15"
## [16,] "State_Region"                  "16"
## [17,] "Source_UC"                     "17"
## [18,] "Source_Full3"                  "18"
## [19,] "Prevalence_Risk2"              "19"
## [20,] "Prevalence_Risk4"              "20"
## [21,] "Year_Factor"                   "21"
Leaky_Column_Names = c('Lower.CI', 'Upper.CI', 'Numerator_ASD', 'Numerator_NonASD', 'Proportion', 
                       'Chi_Wilson_Corrected_Lower.CI', 'Chi_Wilson_Corrected_Upper.CI', 
                       'Prevalence_Risk2', 'Prevalence_Risk4')
# Remove these columns from dataframe
ASD_State_4_MLR <- ASD_State_4_MLR[ , !(names(ASD_State_4_MLR) %in% Leaky_Column_Names)]
#
head(ASD_State_4_MLR)
##   State Denominator Prevalence Year Source
## 1    AZ       45322        6.5 2000   addm
## 2    GA       43593        6.5 2000   addm
## 3    MD       21532        5.5 2000   addm
## 4    NJ       29714        9.9 2000   addm
## 5    SC       24535        6.3 2000   addm
## 6    WV       23065        4.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2       State_Region Source_UC
## 1        AZ-Arizona        D8 Mountain      ADDM
## 2        GA-Georgia  D5 South Atlantic      ADDM
## 3       MD-Maryland  D5 South Atlantic      ADDM
## 4     NJ-New Jersey D2 Middle Atlantic      ADDM
## 5 SC-South Carolina  D5 South Atlantic      ADDM
## 6  WV-West Virginia  D5 South Atlantic      ADDM
##                                                  Source_Full3 Year_Factor
## 1 ADDM Autism & Developmental Disabilities Monitoring Network        2000
## 2 ADDM Autism & Developmental Disabilities Monitoring Network        2000
## 3 ADDM Autism & Developmental Disabilities Monitoring Network        2000
## 4 ADDM Autism & Developmental Disabilities Monitoring Network        2000
## 5 ADDM Autism & Developmental Disabilities Monitoring Network        2000
## 6 ADDM Autism & Developmental Disabilities Monitoring Network        2000

Remove redundant/duplicate variables (aliased coefficients), retaining one for each type of information is enough:

https://en.wikipedia.org/wiki/Multicollinearity

https://stats.stackexchange.com/questions/112442/what-are-aliased-coefficients

Redundant_Column_Names = c('State', 'Source_Full1', 'State_Full1', 'State_Region', 'Source_UC', 'Source_Full3', 'Year_Factor')
# Remove these columns from dataframe
ASD_State_4_MLR <- ASD_State_4_MLR[ , !(names(ASD_State_4_MLR) %in% Redundant_Column_Names)]
#
head(ASD_State_4_MLR)
##   Denominator Prevalence Year Source       State_Full2
## 1       45322        6.5 2000   addm        AZ-Arizona
## 2       43593        6.5 2000   addm        GA-Georgia
## 3       21532        5.5 2000   addm       MD-Maryland
## 4       29714        9.9 2000   addm     NJ-New Jersey
## 5       24535        6.3 2000   addm SC-South Carolina
## 6       23065        4.5 2000   addm  WV-West Virginia

MLR Workshop Task: 3. c. Visualize the data to gain insights

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
plot(ASD_State_4_MLR$Year, ASD_State_4_MLR$Prevalence)

plot(as.factor(ASD_State_4_MLR$Year), ASD_State_4_MLR$Prevalence)

plot(ASD_State_4_MLR$Denominator, ASD_State_4_MLR$Prevalence)

# To use bin() function
# https://www.rdocumentation.org/packages/OneR/versions/2.2/topics/bin
if(!require(OneR)){install.packages("OneR")}
## Loading required package: OneR
library('OneR')

# Bin 'Denominator'
plot(bin(ASD_State_4_MLR$Denominator, nbins = 10), ASD_State_4_MLR$Prevalence)

plot(ASD_State_4_MLR$Source, ASD_State_4_MLR$Prevalence)

plot(ASD_State_4_MLR$State_Full2, ASD_State_4_MLR$Prevalence)


MLR Workshop Task: 4. d. Compute correlation between variables and apply multiple regression.

Recode categorical variable to dummy (numeric) variable using one-hot encoding:

# To use select_if() function
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("dplyr")

summary(select_if(ASD_State_4_MLR, is.numeric))
##   Denominator        Prevalence          Year     
##  Min.   :    965   Min.   : 0.400   Min.   :2000  
##  1st Qu.: 107151   1st Qu.: 3.100   1st Qu.:2003  
##  Median : 353328   Median : 5.600   Median :2007  
##  Mean   : 604689   Mean   : 7.191   Mean   :2007  
##  3rd Qu.: 767928   3rd Qu.: 9.200   3rd Qu.:2011  
##  Max.   :5824922   Max.   :42.700   Max.   :2016
correlation = cor(select_if(ASD_State_4_MLR, is.numeric))
correlation
##             Denominator Prevalence       Year
## Denominator  1.00000000 -0.1374662 0.02851671
## Prevalence  -0.13746621  1.0000000 0.64002950
## Year         0.02851671  0.6400295 1.00000000
# Variable's correlation against target dependent variable:
correlation[, 2]
## Denominator  Prevalence        Year 
##  -0.1374662   1.0000000   0.6400295
if(!require(corrplot)){install.packages("corrplot")}
## Loading required package: corrplot
## corrplot 0.84 loaded
library("corrplot")
corrplot(correlation, tl.col="black", tl.pos = "lt")

str(ASD_State_4_MLR)
## 'data.frame':    1692 obs. of  5 variables:
##  $ Denominator: int  45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
##  $ Prevalence : num  6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
##  $ Year       : int  2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
##  $ Source     : Factor w/ 4 levels "addm","medi",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ State_Full2: Factor w/ 51 levels "AK-Alaska","AL-Alabama",..: 4 11 21 32 41 50 2 4 3 6 ...
# To build (National level) ASD Prevalence predictive model for all state's:
# In situations that we won't know the US. State name, we can also fit a model without State name/code:
fit_model = lm(Prevalence ~ . - State_Full2, data = ASD_State_4_MLR) # Exclude a variable using: "- variable"
#
summary(fit_model)
## 
## Call:
## lm(formula = Prevalence ~ . - State_Full2, data = ASD_State_4_MLR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4476  -1.9332  -0.2786   1.2479  21.0130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.310e+03  3.819e+01 -34.306   <2e-16 ***
## Denominator -1.139e-07  1.057e-07  -1.078    0.281    
## Year         6.583e-01  1.902e-02  34.606   <2e-16 ***
## Sourcemedi  -4.550e+00  3.964e-01 -11.478   <2e-16 ***
## Sourcensch   6.699e+00  5.171e-01  12.956   <2e-16 ***
## Sourcesped  -5.611e+00  3.984e-01 -14.085   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 1686 degrees of freedom
## Multiple R-squared:  0.6585, Adjusted R-squared:  0.6575 
## F-statistic: 650.2 on 5 and 1686 DF,  p-value: < 2.2e-16
<p>
    Adjusted $R^2$ = 0.6575
</p>
# To build (US. State level) ASD Prevalence predictive model for specific state's prevalence:
# In situations that we shall know the US. State name. (A state name is required during prediciton.)
fit_model = lm(Prevalence ~ . , data = ASD_State_4_MLR) # "~." means all other variables, including factors
#
summary(fit_model)
## 
## Call:
## lm(formula = Prevalence ~ ., data = ASD_State_4_MLR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3326  -1.3626  -0.0689   1.2558  19.0273 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.305e+03  3.144e+01 -41.498  < 2e-16 ***
## Denominator                         1.152e-06  2.252e-07   5.115 3.50e-07 ***
## Year                                6.558e-01  1.566e-02  41.889  < 2e-16 ***
## Sourcemedi                         -4.997e+00  3.557e-01 -14.048  < 2e-16 ***
## Sourcensch                          6.100e+00  4.404e-01  13.853  < 2e-16 ***
## Sourcesped                         -6.699e+00  3.915e-01 -17.110  < 2e-16 ***
## State_Full2AL-Alabama               1.377e+00  6.917e-01   1.990 0.046725 *  
## State_Full2AR-Arkansas             -8.466e-01  6.838e-01  -1.238 0.215831    
## State_Full2AZ-Arizona               1.592e-01  6.841e-01   0.233 0.816057    
## State_Full2CA-California           -5.092e+00  1.188e+00  -4.284 1.94e-05 ***
## State_Full2CO-Colorado             -2.565e+00  6.875e-01  -3.732 0.000197 ***
## State_Full2CT-Connecticut           3.031e-01  7.004e-01   0.433 0.665205    
## State_Full2DC-District of Columbia -1.383e+00  7.510e-01  -1.841 0.065799 .  
## State_Full2DE-Delaware              5.313e-01  7.030e-01   0.756 0.449933    
## State_Full2FL-Florida              -2.829e+00  7.793e-01  -3.630 0.000292 ***
## State_Full2GA-Georgia              -1.024e+00  7.045e-01  -1.454 0.146273    
## State_Full2HI-Hawaii               -2.149e+00  7.087e-01  -3.032 0.002467 ** 
## State_Full2IA-Iowa                 -2.160e+00  7.046e-01  -3.066 0.002206 ** 
## State_Full2ID-Idaho                 2.484e+00  7.034e-01   3.532 0.000424 ***
## State_Full2IL-Illinois             -1.648e+00  7.613e-01  -2.164 0.030575 *  
## State_Full2IN-Indiana               1.420e+00  7.103e-01   2.000 0.045713 *  
## State_Full2KS-Kansas               -8.059e-01  7.159e-01  -1.126 0.260465    
## State_Full2KY-Kentucky             -8.268e-01  7.077e-01  -1.168 0.242887    
## State_Full2LA-Louisiana            -2.557e+00  7.104e-01  -3.600 0.000328 ***
## State_Full2MA-Massachusetts         7.401e-01  7.074e-01   1.046 0.295625    
## State_Full2MD-Maryland              2.673e-01  6.792e-01   0.394 0.693986    
## State_Full2ME-Maine                 6.435e+00  7.353e-01   8.752  < 2e-16 ***
## State_Full2MI-Michigan              3.667e-01  7.380e-01   0.497 0.619395    
## State_Full2MN-Minnesota             6.592e+00  6.997e-01   9.422  < 2e-16 ***
## State_Full2MO-Missouri              1.443e-01  6.842e-01   0.211 0.832955    
## State_Full2MS-Mississippi          -2.222e+00  7.170e-01  -3.099 0.001972 ** 
## State_Full2MT-Montana               1.431e+00  7.210e-01   1.984 0.047367 *  
## State_Full2NC-North Carolina       -2.705e-01  6.979e-01  -0.388 0.698328    
## State_Full2ND-North Dakota          7.056e-01  7.088e-01   0.996 0.319629    
## State_Full2NE-Nebraska             -1.845e+00  7.090e-01  -2.602 0.009349 ** 
## State_Full2NH-New Hampshire         2.265e+00  6.978e-01   3.245 0.001197 ** 
## State_Full2NJ-New Jersey            1.871e+00  6.945e-01   2.694 0.007140 ** 
## State_Full2NM-New Mexico           -2.450e+00  7.156e-01  -3.424 0.000633 ***
## State_Full2NV-Nevada               -5.725e-01  7.040e-01  -0.813 0.416221    
## State_Full2NY-New York             -2.234e+00  7.963e-01  -2.805 0.005089 ** 
## State_Full2OH-Ohio                 -1.097e+00  7.532e-01  -1.456 0.145501    
## State_Full2OK-Oklahoma             -1.934e+00  7.023e-01  -2.753 0.005965 ** 
## State_Full2OR-Oregon                3.188e+00  6.957e-01   4.582 4.96e-06 ***
## State_Full2PA-Pennsylvania          1.478e-01  7.215e-01   0.205 0.837668    
## State_Full2RI-Rhode Island          4.109e+00  6.978e-01   5.889 4.70e-09 ***
## State_Full2SC-South Carolina       -1.590e+00  6.839e-01  -2.326 0.020154 *  
## State_Full2SD-South Dakota         -1.503e+00  7.087e-01  -2.121 0.034061 *  
## State_Full2TN-Tennessee            -2.489e+00  7.100e-01  -3.506 0.000467 ***
## State_Full2TX-Texas                -4.918e+00  9.844e-01  -4.996 6.48e-07 ***
## State_Full2UT-Utah                 -5.550e-01  6.867e-01  -0.808 0.419081    
## State_Full2VA-Virginia              7.937e-02  7.182e-01   0.111 0.912020    
## State_Full2VT-Vermont               3.614e+00  7.147e-01   5.057 4.74e-07 ***
## State_Full2WA-Washington           -1.732e+00  7.168e-01  -2.416 0.015804 *  
## State_Full2WI-Wisconsin            -2.749e-01  6.821e-01  -0.403 0.686973    
## State_Full2WV-West Virginia         5.187e-01  6.935e-01   0.748 0.454664    
## State_Full2WY-Wyoming              -6.109e-01  7.210e-01  -0.847 0.396968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.812 on 1636 degrees of freedom
## Multiple R-squared:  0.7772, Adjusted R-squared:  0.7697 
## F-statistic: 103.8 on 55 and 1636 DF,  p-value: < 2.2e-16
<p>
    Adjusted $R^2$ = 0.7697
</p>

MLR Workshop Task: 5. e. Check multicollinearity, then how to remove multicollinearity.

< Detection of multicollinearity >

Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity. A VIF of 5 or 10 and above indicates a multicollinearity problem.

# To use select_if() function
if(!require(car)){install.packages("car")}
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library("car")
vif(fit_model)
##                 GVIF Df GVIF^(1/(2*Df))
## Denominator 7.737921  1        2.781712
## Year        1.146987  1        1.070975
## Source      2.502245  3        1.165167
## State_Full2 7.768262 50        1.020712

[ Tips ] We notice VIF of Denominator and State_Full2 are high. Let’s exclude them one at a time.

Retain: State; Remove: Denominator then re-build model:

# To build (National level) ASD Prevalence predictive model for all state's:
# In situations that we won't know the US. State name, we can also fit a model without State name/code:
fit_model_with_State = lm(Prevalence ~ . - Denominator, data = ASD_State_4_MLR) # Exclude a variable using: "- variable"
#
summary(fit_model_with_State)
## 
## Call:
## lm(formula = Prevalence ~ . - Denominator, data = ASD_State_4_MLR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7769  -1.3853  -0.0184   1.2957  19.2545 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        -1.319e+03  3.157e+01 -41.774  < 2e-16 ***
## Year                                6.624e-01  1.572e-02  42.130  < 2e-16 ***
## Sourcemedi                         -4.503e+00  3.450e-01 -13.053  < 2e-16 ***
## Sourcensch                          6.149e+00  4.436e-01  13.861  < 2e-16 ***
## Sourcesped                         -5.685e+00  3.403e-01 -16.708  < 2e-16 ***
## State_Full2AL-Alabama               1.875e+00  6.901e-01   2.717 0.006662 ** 
## State_Full2AR-Arkansas             -4.970e-01  6.855e-01  -0.725 0.468547    
## State_Full2AZ-Arizona               8.608e-01  6.754e-01   1.275 0.202621    
## State_Full2CA-California           -2.119e-01  7.141e-01  -0.297 0.766731    
## State_Full2CO-Colorado             -2.086e+00  6.862e-01  -3.039 0.002410 ** 
## State_Full2CT-Connecticut           6.142e-01  7.031e-01   0.874 0.382493    
## State_Full2DC-District of Columbia -1.414e+00  7.567e-01  -1.869 0.061793 .  
## State_Full2DE-Delaware              5.281e-01  7.084e-01   0.746 0.456070    
## State_Full2FL-Florida              -1.004e+00  6.982e-01  -1.438 0.150554    
## State_Full2GA-Georgia               2.090e-02  6.794e-01   0.031 0.975460    
## State_Full2HI-Hawaii               -2.126e+00  7.141e-01  -2.977 0.002949 ** 
## State_Full2IA-Iowa                 -1.919e+00  7.084e-01  -2.709 0.006828 ** 
## State_Full2ID-Idaho                 2.595e+00  7.085e-01   3.663 0.000258 ***
## State_Full2IL-Illinois             -1.531e-01  7.084e-01  -0.216 0.828894    
## State_Full2IN-Indiana               2.102e+00  7.031e-01   2.990 0.002833 ** 
## State_Full2KS-Kansas               -5.955e-01  7.202e-01  -0.827 0.408469    
## State_Full2KY-Kentucky             -4.094e-01  7.084e-01  -0.578 0.563423    
## State_Full2LA-Louisiana            -2.034e+00  7.084e-01  -2.872 0.004134 ** 
## State_Full2MA-Massachusetts         1.337e+00  7.031e-01   1.901 0.057418 .  
## State_Full2MD-Maryland              8.308e-01  6.754e-01   1.230 0.218796    
## State_Full2ME-Maine                 6.458e+00  7.409e-01   8.716  < 2e-16 ***
## State_Full2MI-Michigan              1.516e+00  7.084e-01   2.139 0.032544 *  
## State_Full2MN-Minnesota             7.094e+00  6.980e-01  10.164  < 2e-16 ***
## State_Full2MO-Missouri              7.645e-01  6.785e-01   1.127 0.259995    
## State_Full2MS-Mississippi          -1.940e+00  7.204e-01  -2.692 0.007164 ** 
## State_Full2MT-Montana               1.455e+00  7.265e-01   2.002 0.045447 *  
## State_Full2NC-North Carolina        6.671e-01  6.785e-01   0.983 0.325658    
## State_Full2ND-North Dakota          6.543e-01  7.141e-01   0.916 0.359735    
## State_Full2NE-Nebraska             -1.744e+00  7.141e-01  -2.442 0.014701 *  
## State_Full2NH-New Hampshire         2.320e+00  7.031e-01   3.300 0.000987 ***
## State_Full2NJ-New Jersey            2.669e+00  6.819e-01   3.914 9.44e-05 ***
## State_Full2NM-New Mexico           -2.283e+00  7.204e-01  -3.169 0.001557 ** 
## State_Full2NV-Nevada               -3.875e-01  7.084e-01  -0.547 0.584423    
## State_Full2NY-New York             -2.707e-01  7.031e-01  -0.385 0.700290    
## State_Full2OH-Ohio                  2.075e-01  7.141e-01   0.291 0.771469    
## State_Full2OK-Oklahoma             -1.525e+00  7.031e-01  -2.169 0.030198 *  
## State_Full2OR-Oregon                3.511e+00  6.981e-01   5.030 5.46e-07 ***
## State_Full2PA-Pennsylvania          1.324e+00  6.891e-01   1.921 0.054844 .  
## State_Full2RI-Rhode Island          4.144e+00  7.031e-01   5.895 4.55e-09 ***
## State_Full2SC-South Carolina       -1.087e+00  6.819e-01  -1.594 0.111164    
## State_Full2SD-South Dakota         -1.525e+00  7.141e-01  -2.135 0.032901 *  
## State_Full2TN-Tennessee            -1.817e+00  7.031e-01  -2.584 0.009849 ** 
## State_Full2TX-Texas                -1.456e+00  7.204e-01  -2.022 0.043387 *  
## State_Full2UT-Utah                 -2.537e-01  6.894e-01  -0.368 0.712949    
## State_Full2VA-Virginia              8.313e-01  7.084e-01   1.173 0.240803    
## State_Full2VT-Vermont               3.604e+00  7.201e-01   5.005 6.18e-07 ***
## State_Full2WA-Washington           -1.016e+00  7.084e-01  -1.434 0.151848    
## State_Full2WI-Wisconsin             2.790e-01  6.786e-01   0.411 0.680989    
## State_Full2WV-West Virginia         6.708e-01  6.982e-01   0.961 0.336837    
## State_Full2WY-Wyoming              -6.188e-01  7.265e-01  -0.852 0.394499    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.834 on 1637 degrees of freedom
## Multiple R-squared:  0.7736, Adjusted R-squared:  0.7662 
## F-statistic: 103.6 on 54 and 1637 DF,  p-value: < 2.2e-16
vif(fit_model_with_State)
##                 GVIF Df GVIF^(1/(2*Df))
## Year        1.139247  1        1.067355
## Source      1.306350  3        1.045546
## State_Full2 1.149997 50        1.001399
<p>
    Adjusted $R^2$ = 0.7662
</p>

Retain: Denominator; Remove: State; then re-build model:

# To build (National level) ASD Prevalence predictive model for all state's:
# In situations that we won't know the US. State name, we can also fit a model without State name/code:
fit_model_with_Denominator = lm(Prevalence ~ . - State_Full2, data = ASD_State_4_MLR) # Exclude a variable using: "- variable"
#
summary(fit_model_with_Denominator)
## 
## Call:
## lm(formula = Prevalence ~ . - State_Full2, data = ASD_State_4_MLR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4476  -1.9332  -0.2786   1.2479  21.0130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.310e+03  3.819e+01 -34.306   <2e-16 ***
## Denominator -1.139e-07  1.057e-07  -1.078    0.281    
## Year         6.583e-01  1.902e-02  34.606   <2e-16 ***
## Sourcemedi  -4.550e+00  3.964e-01 -11.478   <2e-16 ***
## Sourcensch   6.699e+00  5.171e-01  12.956   <2e-16 ***
## Sourcesped  -5.611e+00  3.984e-01 -14.085   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.43 on 1686 degrees of freedom
## Multiple R-squared:  0.6585, Adjusted R-squared:  0.6575 
## F-statistic: 650.2 on 5 and 1686 DF,  p-value: < 2.2e-16
vif(fit_model_with_Denominator)
##                 GVIF Df GVIF^(1/(2*Df))
## Denominator 1.145505  1        1.070283
## Year        1.138674  1        1.067086
## Source      1.301973  3        1.044962
<p>
    Adjusted $R^2$ = 0.6575
</p>

MLR Workshop Task: 6. f. How is your final model looks like?

  • During prediction, if US. State name will be known, then the fit_model_with_State can be better because it has higher \(R^2\) value.

  • During prediction, if US. State name will NOT be known, then the fit_model_with_Denominator can be adopted because it doesn’t require state name as input for prediciton.

MLR Prediciton 1

Let’s use fit_model_with_State to predict CA-California ASD Prevalence of Year 2016 if ADDM would have conducted a survey

newdata = ASD_State_4_MLR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Denominator = 50000
newdata$Year = 2016
newdata$Source = "addm"
#newdata$State_Full2 = "CA-California"
newdata$State_Full2 = "AZ-Arizona"

newdata
##   Denominator Prevalence Year Source State_Full2
## 1       50000         NA 2016   addm  AZ-Arizona
predict(fit_model_with_State, newdata, interval = "predict")
##        fit      lwr      upr
## 1 17.54292 11.88535 23.20049
#
cat("Predicted ASD Prevalence is", round(predict(fit_model_with_State, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence is 17.5 per 1,000 Children

MLR Prediciton 2

Let’s use fit_model_with_Denominator to predict National level ASD Prevalence of Year 2016 if ADDM would have conducted a survey

predict(fit_model_with_Denominator, newdata, interval = "predict")
##        fit      lwr      upr
## 1 17.07629 10.30306 23.84952
#
cat("Predicted ASD Prevalence is", round(predict(fit_model_with_Denominator, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence is 17.1 per 1,000 Children

MLR Prediciton 2

Let’s use fit_model to predict FL-Florida State level ASD Prevalence of Year 2025 if SPED will conduct a record review/survey of 2,600,000 children.

newdata = ASD_State_4_MLR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Denominator = 2600000
newdata$Year = 2025
newdata$Source = "sped"
newdata$State_Full2 = "FL-Florida"

newdata
##   Denominator Prevalence Year Source State_Full2
## 1     2600000         NA 2025   sped  FL-Florida
predict(fit_model, newdata, interval = "predict")
##        fit      lwr     upr
## 1 16.63773 11.00985 22.2656
#
cat("Predicted ASD Prevalence is", round(predict(fit_model, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence is 16.6 per 1,000 Children
<a href="">
     <img src="" width="750" align="center">
</a>

Linear Model: Polynomial (Linear) Regression (PLR)

<h3>
Linear Model: Polynomial (Linear) Regression (PLR) - Workshop Task
</h3>

Workshop Task:

    1. Get the data.
    1. Discover and visualize the data to gain insights (Is there missing Value in the dataframe, then how to deal with the missing value)
    1. Visualize Data and trends
    1. Compute correlation between variables and apply multiple regression.
    1. Multiple polynomial regression.

PLR Workshop Task: 1. a. Get the data.

Use Case Data: “../dataset/ADV_ASD_State_R.csv”

Read in CSV data, storing as R dataframe

# Read back in above saved file:
ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
##   State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1    AZ       45322        6.5      5.8      7.3 2000   addm
## 2    GA       43593        6.5      5.8      7.3 2000   addm
## 3    MD       21532        5.5      4.6      6.6 2000   addm
## 4    NJ       29714        9.9      8.9     11.1 2000   addm
## 5    SC       24535        6.3      5.4      7.4 2000   addm
## 6    WV       23065        4.5      3.7      5.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2 Numerator_ASD Numerator_NonASD  Proportion
## 1        AZ-Arizona           295            45027 0.006508980
## 2        GA-Georgia           283            43310 0.006491868
## 3       MD-Maryland           118            21414 0.005480215
## 4     NJ-New Jersey           294            29420 0.009894326
## 5 SC-South Carolina           155            24380 0.006317506
## 6  WV-West Virginia           104            22961 0.004508996
##   Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1                      5.798905                      7.303948             9.7
## 2                      5.769431                      7.302595            11.0
## 3                      4.557351                      6.583638             8.6
## 4                      8.814705                     11.102544            14.8
## 5                      5.381662                      7.411085             9.3
## 6                      3.703408                      5.483723             6.6
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1           8.5          11.1               3.2             2.5             4.0
## 2           9.7          12.4               2.0             1.5             2.7
## 3           7.1          10.6               2.2             1.5             2.7
## 4          13.0          16.8               4.3             3.3             5.5
## 5           7.8          11.2               3.3             2.4             4.5
## 6           5.2           8.2               2.4             1.6             3.5
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                           8.6                         7.5
## 2                           7.9                         6.7
## 3                           4.9                         3.8
## 4                          11.3                         9.5
## 5                           6.5                         5.2
## 6                           4.5                         3.7
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                         9.8                           7.3
## 2                         9.3                           5.3
## 3                         6.4                           6.1
## 4                        13.3                          10.6
## 5                         8.2                           5.8
## 6                         5.5                            NA
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                         4.4                        12.2                  NA
## 2                         4.4                         6.4                  NA
## 3                         4.7                         8.0                  NA
## 4                         8.5                        13.1                  NA
## 5                         4.5                         7.3                  NA
## 6                          NA                          NA                  NA
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1                NA                NA                                   NA
## 2                NA                NA                                   NA
## 3                NA                NA                                   NA
## 4                NA                NA                                   NA
## 5                NA                NA                                   NA
## 6                NA                NA                                   NA
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                                 NA                                 NA
## 2                                 NA                                 NA
## 3                                 NA                                 NA
## 4                                 NA                                 NA
## 5                                 NA                                 NA
## 6                                 NA                                 NA
##         State_Region Source_UC
## 1        D8 Mountain      ADDM
## 2  D5 South Atlantic      ADDM
## 3  D5 South Atlantic      ADDM
## 4 D2 Middle Atlantic      ADDM
## 5  D5 South Atlantic      ADDM
## 6  D5 South Atlantic      ADDM
##                                                  Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network              Low
##   Prevalence_Risk4 Year_Factor
## 1           Medium        2000
## 2           Medium        2000
## 3           Medium        2000
## 4           Medium        2000
## 5           Medium        2000
## 6              Low        2000
names(ASD_State)
##  [1] "State"                               
##  [2] "Denominator"                         
##  [3] "Prevalence"                          
##  [4] "Lower.CI"                            
##  [5] "Upper.CI"                            
##  [6] "Year"                                
##  [7] "Source"                              
##  [8] "Source_Full1"                        
##  [9] "State_Full1"                         
## [10] "State_Full2"                         
## [11] "Numerator_ASD"                       
## [12] "Numerator_NonASD"                    
## [13] "Proportion"                          
## [14] "Chi_Wilson_Corrected_Lower.CI"       
## [15] "Chi_Wilson_Corrected_Upper.CI"       
## [16] "Male.Prevalence"                     
## [17] "Male.Lower.CI"                       
## [18] "Male.Upper.CI"                       
## [19] "Female.Prevalence"                   
## [20] "Female.Lower.CI"                     
## [21] "Female.Upper.CI"                     
## [22] "Non.hispanic.white.Prevalence"       
## [23] "Non.hispanic.white.Lower.CI"         
## [24] "Non.hispanic.white.Upper.CI"         
## [25] "Non.hispanic.black.Prevalence"       
## [26] "Non.hispanic.black.Lower.CI"         
## [27] "Non.hispanic.black.Upper.CI"         
## [28] "Hispanic.Prevalence"                 
## [29] "Hispanic.Lower.CI"                   
## [30] "Hispanic.Upper.CI"                   
## [31] "Asian.or.Pacific.Islander.Prevalence"
## [32] "Asian.or.Pacific.Islander.Lower.CI"  
## [33] "Asian.or.Pacific.Islander.Upper.CI"  
## [34] "State_Region"                        
## [35] "Source_UC"                           
## [36] "Source_Full3"                        
## [37] "Prevalence_Risk2"                    
## [38] "Prevalence_Risk4"                    
## [39] "Year_Factor"
# Filter [ Source: SPED ], including only two clomuns for SLR:
# Dependent variable: Prevalence
# independent variable: Year
# ASD_State_4_PLR = subset(ASD_State, Source_UC == 'SPED' & State_Full2 == 'MN-Minnesota', select = c(Prevalence, Year))
# ASD_State_4_PLR = subset(ASD_State, Source_UC == 'SPED' & State_Full2 == 'MS-Mississippi', select = c(Prevalence, Year))
ASD_State_4_PLR = subset(ASD_State, Source_UC == 'SPED' & State_Full2 == 'FL-Florida', select = c(Prevalence, Year))
#
dim(ASD_State_4_PLR)
## [1] 17  2
head(ASD_State_4_PLR)
##      Prevalence Year
## 849         1.5 2000
## 900         1.8 2001
## 951         2.1 2002
## 1002        2.4 2003
## 1052        2.7 2004
## 1100        3.0 2005

PLR Workshop Task: 2. b. Discover and visualize the data to gain insights (Is there missing Value in the dataframe, then how to deal with the missing value).

summary(ASD_State_4_PLR)
##    Prevalence          Year     
##  Min.   : 1.500   Min.   :2000  
##  1st Qu.: 2.700   1st Qu.:2004  
##  Median : 4.900   Median :2008  
##  Mean   : 5.694   Mean   :2008  
##  3rd Qu.: 8.300   3rd Qu.:2012  
##  Max.   :12.100   Max.   :2016
# Check whether each columns got missing value:
lapply(ASD_State_4_PLR, function(col_x)sum(is.na(col_x)))
## $Prevalence
## [1] 0
## 
## $Year
## [1] 0
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
barplot(apply(ASD_State_4_PLR, 2, function(col_x)sum(is.na(col_x))))

No missing values


PLR Workshop Task: 3. c. Visualize the data to gain insights

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
plot(ASD_State_4_PLR$Year, ASD_State_4_PLR$Prevalence)


PLR Workshop Task: 4. d. Compute correlation between variables and apply multiple regression.

Recode categorical variable to dummy (numeric) variable using one-hot encoding:

# To use select_if() function
if(!require(dplyr)){install.packages("dplyr")}
library("dplyr")

summary(select_if(ASD_State_4_PLR, is.numeric))
##    Prevalence          Year     
##  Min.   : 1.500   Min.   :2000  
##  1st Qu.: 2.700   1st Qu.:2004  
##  Median : 4.900   Median :2008  
##  Mean   : 5.694   Mean   :2008  
##  3rd Qu.: 8.300   3rd Qu.:2012  
##  Max.   :12.100   Max.   :2016
correlation = cor(select_if(ASD_State_4_PLR, is.numeric))
correlation
##            Prevalence     Year
## Prevalence   1.000000 0.980119
## Year         0.980119 1.000000
# Variable's correlation against target dependent variable:
correlation[, 1]
## Prevalence       Year 
##   1.000000   0.980119
str(ASD_State_4_PLR)
## 'data.frame':    17 obs. of  2 variables:
##  $ Prevalence: num  1.5 1.8 2.1 2.4 2.7 3 3.5 4.1 4.9 5.7 ...
##  $ Year      : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
# SLR
fit_model_SLR = lm(Prevalence ~ Year , data = ASD_State_4_PLR)
#
summary(fit_model_SLR)
## 
## Call:
## lm(formula = Prevalence ~ Year, data = ASD_State_4_PLR)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9150 -0.6566 -0.1108  0.4809  1.2392 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1358.0726    71.2824  -19.05 6.37e-12 ***
## Year            0.6792     0.0355   19.13 6.00e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.717 on 15 degrees of freedom
## Multiple R-squared:  0.9606, Adjusted R-squared:  0.958 
## F-statistic:   366 on 1 and 15 DF,  p-value: 5.995e-12
plot(fit_model_SLR)

<p>
    Adjusted $R^2$ = 0.958
</p>
# PLR (quadratic)
fit_model_PLR = lm(Prevalence ~ Year + I(Year^2), data = ASD_State_4_PLR)
#
summary(fit_model_PLR)
## 
## Call:
## lm(formula = Prevalence ~ Year + I(Year^2), data = ASD_State_4_PLR)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26223 -0.05325  0.03671  0.11045  0.17918 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.231e+05  6.980e+03   17.64 5.87e-11 ***
## Year        -1.233e+02  6.953e+00  -17.73 5.45e-11 ***
## I(Year^2)    3.087e-02  1.731e-03   17.83 5.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1524 on 14 degrees of freedom
## Multiple R-squared:  0.9983, Adjusted R-squared:  0.9981 
## F-statistic:  4209 on 2 and 14 DF,  p-value: < 2.2e-16

Abount I() function: https://stackoverflow.com/questions/8055508/in-r-formulas-why-do-i-have-to-use-the-i-function-on-power-terms-like-y-i

plot(fit_model_PLR)

<p>
    Adjusted $R^2$ = 0.9981
</p>

Visualise the difference between SLR & PLR:

plot(ASD_State_4_PLR$Year, ASD_State_4_PLR$Prevalence)
# add SLR line
lines(ASD_State_4_PLR$Year, predict(fit_model_SLR, ASD_State_4_PLR), col="blue", lwd=2) # or # abline(fit_model_SLR, col="blue", lwd=2)
# add PLR line
lines(ASD_State_4_PLR$Year, predict(fit_model_PLR, ASD_State_4_PLR), col="orange", lwd=2) 


PLR Prediciton

newdata = ASD_State_4_PLR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Year = 2025

newdata
##     Prevalence Year
## 849         NA 2025
predict(fit_model_PLR, newdata, interval = "predict")
##          fit      lwr      upr
## 849 25.42036 24.34469 26.49602
#
cat("Predicted ASD Prevalence of Year [", newdata$Year, "] is", round(predict(fit_model_PLR, newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence of Year [ 2025 ] is 25.4 per 1,000 Children

Multiple PLR Workshop Task: 5. e. Multiple polynomial regression (MPR). (Enhance MLR by adding higher order transformed variables.)

Resuse MLR data: ASD_State_4_MLR Cop to new dataframe: ASD_State_4_MPR

ASD_State_4_MPR = ASD_State_4_MLR

dim(ASD_State_4_MPR)
## [1] 1692    5
summary(ASD_State_4_MPR)
##   Denominator        Prevalence          Year       Source   
##  Min.   :    965   Min.   : 0.400   Min.   :2000   addm: 86  
##  1st Qu.: 107151   1st Qu.: 3.100   1st Qu.:2003   medi:655  
##  Median : 353328   Median : 5.600   Median :2007   nsch: 98  
##  Mean   : 604689   Mean   : 7.191   Mean   :2007   sped:853  
##  3rd Qu.: 767928   3rd Qu.: 9.200   3rd Qu.:2011             
##  Max.   :5824922   Max.   :42.700   Max.   :2016             
##                                                              
##             State_Full2  
##  AZ-Arizona       :  40  
##  MD-Maryland      :  40  
##  GA-Georgia       :  39  
##  MO-Missouri      :  39  
##  NC-North Carolina:  39  
##  WI-Wisconsin     :  39  
##  (Other)          :1456
# Check whether each columns got missing value:
lapply(ASD_State_4_MLR, function(col_x)sum(is.na(col_x)))
## $Denominator
## [1] 0
## 
## $Prevalence
## [1] 0
## 
## $Year
## [1] 0
## 
## $Source
## [1] 0
## 
## $State_Full2
## [1] 0
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
barplot(apply(ASD_State_4_MLR, 2, function(col_x)sum(is.na(col_x))))

Build Multiple PLR model: + I(Year^2) + I(log(Denominator)

fit_model_MPR = lm(Prevalence ~ . + I(Year^2) + I(log(Denominator)), data = ASD_State_4_MPR) # "~." means all other variables, including factors
#
summary(fit_model_MPR)
## 
## Call:
## lm(formula = Prevalence ~ . + I(Year^2) + I(log(Denominator)), 
##     data = ASD_State_4_MPR)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.697 -1.238  0.007  1.157 19.672 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         7.908e+04  1.309e+04   6.044 1.86e-09 ***
## Denominator                         2.074e-06  2.340e-07   8.863  < 2e-16 ***
## Year                               -7.943e+01  1.304e+01  -6.093 1.38e-09 ***
## Sourcemedi                          8.189e-01  6.860e-01   1.194 0.232812    
## Sourcensch                          8.667e-02  7.138e-01   0.121 0.903371    
## Sourcesped                          2.740e-01  8.268e-01   0.331 0.740399    
## State_Full2AL-Alabama               4.765e+00  7.495e-01   6.357 2.66e-10 ***
## State_Full2AR-Arkansas              1.978e+00  7.178e-01   2.755 0.005930 ** 
## State_Full2AZ-Arizona               4.043e+00  7.703e-01   5.248 1.74e-07 ***
## State_Full2CA-California           -5.541e-01  1.239e+00  -0.447 0.654693    
## State_Full2CO-Colorado              4.215e-01  7.283e-01   0.579 0.562857    
## State_Full2CT-Connecticut           2.975e+00  7.250e-01   4.103 4.27e-05 ***
## State_Full2DC-District of Columbia -2.342e+00  7.256e-01  -3.227 0.001275 ** 
## State_Full2DE-Delaware              4.775e-01  6.739e-01   0.709 0.478711    
## State_Full2FL-Florida               1.942e+00  8.972e-01   2.165 0.030531 *  
## State_Full2GA-Georgia               3.500e+00  8.244e-01   4.245 2.31e-05 ***
## State_Full2HI-Hawaii               -1.547e+00  6.823e-01  -2.267 0.023498 *  
## State_Full2IA-Iowa                  1.358e-01  7.160e-01   0.190 0.849558    
## State_Full2ID-Idaho                 3.888e+00  6.892e-01   5.642 1.98e-08 ***
## State_Full2IL-Illinois              3.055e+00  8.797e-01   3.473 0.000528 ***
## State_Full2IN-Indiana               5.214e+00  7.852e-01   6.641 4.24e-11 ***
## State_Full2KS-Kansas                1.358e+00  7.212e-01   1.883 0.059905 .  
## State_Full2KY-Kentucky              2.358e+00  7.544e-01   3.126 0.001805 ** 
## State_Full2LA-Louisiana             9.990e-01  7.744e-01   1.290 0.197219    
## State_Full2MA-Massachusetts         4.323e+00  7.724e-01   5.597 2.56e-08 ***
## State_Full2MD-Maryland              3.662e+00  7.407e-01   4.944 8.45e-07 ***
## State_Full2ME-Maine                 7.323e+00  7.112e-01  10.297  < 2e-16 ***
## State_Full2MI-Michigan              4.812e+00  8.455e-01   5.691 1.49e-08 ***
## State_Full2MN-Minnesota             9.790e+00  7.477e-01  13.093  < 2e-16 ***
## State_Full2MO-Missouri              3.854e+00  7.598e-01   5.072 4.38e-07 ***
## State_Full2MS-Mississippi           5.804e-01  7.455e-01   0.779 0.436378    
## State_Full2MT-Montana               1.230e+00  6.913e-01   1.779 0.075351 .  
## State_Full2NC-North Carolina        3.919e+00  7.977e-01   4.913 9.84e-07 ***
## State_Full2ND-North Dakota         -4.673e-01  6.898e-01  -0.677 0.498273    
## State_Full2NE-Nebraska             -3.600e-01  6.961e-01  -0.517 0.605101    
## State_Full2NH-New Hampshire         2.854e+00  6.714e-01   4.250 2.25e-05 ***
## State_Full2NJ-New Jersey            5.703e+00  7.771e-01   7.338 3.39e-13 ***
## State_Full2NM-New Mexico           -2.369e-01  7.227e-01  -0.328 0.743118    
## State_Full2NV-Nevada                1.166e+00  6.982e-01   1.670 0.095033 .  
## State_Full2NY-New York              2.614e+00  9.153e-01   2.856 0.004349 ** 
## State_Full2OH-Ohio                  3.553e+00  8.682e-01   4.092 4.49e-05 ***
## State_Full2OK-Oklahoma              1.227e+00  7.474e-01   1.642 0.100745    
## State_Full2OR-Oregon                5.859e+00  7.202e-01   8.135 8.06e-16 ***
## State_Full2PA-Pennsylvania          4.490e+00  8.246e-01   5.445 5.96e-08 ***
## State_Full2RI-Rhode Island          4.587e+00  6.704e-01   6.841 1.10e-11 ***
## State_Full2SC-South Carolina        1.791e+00  7.436e-01   2.408 0.016138 *  
## State_Full2SD-South Dakota         -1.629e+00  6.795e-01  -2.398 0.016616 *  
## State_Full2TN-Tennessee             1.344e+00  7.893e-01   1.703 0.088708 .  
## State_Full2TX-Texas                -1.007e-03  1.078e+00  -0.001 0.999254    
## State_Full2UT-Utah                  1.500e+00  6.915e-01   2.169 0.030212 *  
## State_Full2VA-Virginia              3.846e+00  7.919e-01   4.856 1.31e-06 ***
## State_Full2VT-Vermont               3.076e+00  6.870e-01   4.478 8.04e-06 ***
## State_Full2WA-Washington            2.227e+00  7.979e-01   2.791 0.005318 ** 
## State_Full2WI-Wisconsin             3.271e+00  7.469e-01   4.379 1.27e-05 ***
## State_Full2WV-West Virginia         2.188e+00  6.877e-01   3.182 0.001492 ** 
## State_Full2WY-Wyoming              -1.521e+00  6.981e-01  -2.178 0.029521 *  
## I(Year^2)                           1.995e-02  3.247e-03   6.145 1.01e-09 ***
## I(log(Denominator))                -2.265e+00  2.335e-01  -9.701  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 1634 degrees of freedom
## Multiple R-squared:  0.7956, Adjusted R-squared:  0.7884 
## F-statistic: 111.6 on 57 and 1634 DF,  p-value: < 2.2e-16
<p>
    Adjusted $R^2$ = 0.7884
</p>

Mutiple PLR Prediciton

# Copy datastructure
newdata = subset(ASD_State_4_MPR, ASD_State_4_MPR$Year == 2016 &
                            ASD_State_4_MPR$State_Full2 == 'FL-Florida' & 
                            ASD_State_4_MPR$Source == 'sped')
newdata
##      Denominator Prevalence Year Source State_Full2
## 1652     2555399       12.1 2016   sped  FL-Florida
newdata = ASD_State_4_MPR[1,] # Copy datastructure
newdata$Prevalence = NA
newdata$Denominator = 2600000
newdata$Year = 2025
newdata$Source = "sped"
newdata$State_Full2 = "FL-Florida"

newdata
##   Denominator Prevalence Year Source State_Full2
## 1     2600000         NA 2025   sped  FL-Florida
predict(fit_model_MPR, newdata, interval = "predict")
##        fit      lwr      upr
## 1 22.70557 17.01847 28.39267
#
cat("Predicted ASD Prevalence of Year [", newdata$Year, "] is", round(predict(fit_model_MPR,newdata), 1), "per 1,000 Children")
## Predicted ASD Prevalence of Year [ 2025 ] is 22.7 per 1,000 Children
<h3>
    Quiz:
</h3>
<p>
    Compare predicted ASD prevalence and model $R^2$ between: Multiple Linear Rregression and Multiple Polynomial Rregression.
</p>
<p>
    Which prediction result would you use? Provide yor justifications.
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<a href="">
     <img src="" width="750" align="center">
</a>

Linear Model: Logistic Regression (LR)

<h3>
Linear Model: Logistic Regression (LR) - Workshop Task
</h3>

Workshop Task:

    1. Get the data.
    1. Logistic regression - Binary Class.
    1. Logistic regression - Multi-Class.

LR Workshop Task: 1. a. Get the data.

Use Case Data: “../dataset/ADV_ASD_State_R.csv”

Read in CSV data, storing as R dataframe

# Read back in above saved file:
# ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
# ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
head(ASD_State)
##   State Denominator Prevalence Lower.CI Upper.CI Year Source
## 1    AZ       45322        6.5      5.8      7.3 2000   addm
## 2    GA       43593        6.5      5.8      7.3 2000   addm
## 3    MD       21532        5.5      4.6      6.6 2000   addm
## 4    NJ       29714        9.9      8.9     11.1 2000   addm
## 5    SC       24535        6.3      5.4      7.4 2000   addm
## 6    WV       23065        4.5      3.7      5.5 2000   addm
##                                             Source_Full1    State_Full1
## 1 Autism & Developmental Disabilities Monitoring Network        Arizona
## 2 Autism & Developmental Disabilities Monitoring Network        Georgia
## 3 Autism & Developmental Disabilities Monitoring Network       Maryland
## 4 Autism & Developmental Disabilities Monitoring Network     New Jersey
## 5 Autism & Developmental Disabilities Monitoring Network South Carolina
## 6 Autism & Developmental Disabilities Monitoring Network  West Virginia
##         State_Full2 Numerator_ASD Numerator_NonASD  Proportion
## 1        AZ-Arizona           295            45027 0.006508980
## 2        GA-Georgia           283            43310 0.006491868
## 3       MD-Maryland           118            21414 0.005480215
## 4     NJ-New Jersey           294            29420 0.009894326
## 5 SC-South Carolina           155            24380 0.006317506
## 6  WV-West Virginia           104            22961 0.004508996
##   Chi_Wilson_Corrected_Lower.CI Chi_Wilson_Corrected_Upper.CI Male.Prevalence
## 1                      5.798905                      7.303948             9.7
## 2                      5.769431                      7.302595            11.0
## 3                      4.557351                      6.583638             8.6
## 4                      8.814705                     11.102544            14.8
## 5                      5.381662                      7.411085             9.3
## 6                      3.703408                      5.483723             6.6
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1           8.5          11.1               3.2             2.5             4.0
## 2           9.7          12.4               2.0             1.5             2.7
## 3           7.1          10.6               2.2             1.5             2.7
## 4          13.0          16.8               4.3             3.3             5.5
## 5           7.8          11.2               3.3             2.4             4.5
## 6           5.2           8.2               2.4             1.6             3.5
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                           8.6                         7.5
## 2                           7.9                         6.7
## 3                           4.9                         3.8
## 4                          11.3                         9.5
## 5                           6.5                         5.2
## 6                           4.5                         3.7
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                         9.8                           7.3
## 2                         9.3                           5.3
## 3                         6.4                           6.1
## 4                        13.3                          10.6
## 5                         8.2                           5.8
## 6                         5.5                            NA
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                         4.4                        12.2                  NA
## 2                         4.4                         6.4                  NA
## 3                         4.7                         8.0                  NA
## 4                         8.5                        13.1                  NA
## 5                         4.5                         7.3                  NA
## 6                          NA                          NA                  NA
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1                NA                NA                                   NA
## 2                NA                NA                                   NA
## 3                NA                NA                                   NA
## 4                NA                NA                                   NA
## 5                NA                NA                                   NA
## 6                NA                NA                                   NA
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                                 NA                                 NA
## 2                                 NA                                 NA
## 3                                 NA                                 NA
## 4                                 NA                                 NA
## 5                                 NA                                 NA
## 6                                 NA                                 NA
##         State_Region Source_UC
## 1        D8 Mountain      ADDM
## 2  D5 South Atlantic      ADDM
## 3  D5 South Atlantic      ADDM
## 4 D2 Middle Atlantic      ADDM
## 5  D5 South Atlantic      ADDM
## 6  D5 South Atlantic      ADDM
##                                                  Source_Full3 Prevalence_Risk2
## 1 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 2 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 3 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 4 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 5 ADDM Autism & Developmental Disabilities Monitoring Network             High
## 6 ADDM Autism & Developmental Disabilities Monitoring Network              Low
##   Prevalence_Risk4 Year_Factor
## 1           Medium        2000
## 2           Medium        2000
## 3           Medium        2000
## 4           Medium        2000
## 5           Medium        2000
## 6              Low        2000
Column_Names = c("Prevalence_Risk2", "Denominator", "Year", "Source", "State_Full2")
ASD_State_4_LR_Risk2 <- ASD_State[ , (names(ASD_State) %in% Column_Names)]
dim(ASD_State_4_LR_Risk2)
## [1] 1692    5
head(ASD_State_4_LR_Risk2)
##   Denominator Year Source       State_Full2 Prevalence_Risk2
## 1       45322 2000   addm        AZ-Arizona             High
## 2       43593 2000   addm        GA-Georgia             High
## 3       21532 2000   addm       MD-Maryland             High
## 4       29714 2000   addm     NJ-New Jersey             High
## 5       24535 2000   addm SC-South Carolina             High
## 6       23065 2000   addm  WV-West Virginia              Low
Column_Names = c("Prevalence_Risk4", "Denominator", "Year", "Source", "State_Full2")
ASD_State_4_LR_Risk4 <- ASD_State[ , (names(ASD_State) %in% Column_Names)]
dim(ASD_State_4_LR_Risk4)
## [1] 1692    5
head(ASD_State_4_LR_Risk4)
##   Denominator Year Source       State_Full2 Prevalence_Risk4
## 1       45322 2000   addm        AZ-Arizona           Medium
## 2       43593 2000   addm        GA-Georgia           Medium
## 3       21532 2000   addm       MD-Maryland           Medium
## 4       29714 2000   addm     NJ-New Jersey           Medium
## 5       24535 2000   addm SC-South Carolina           Medium
## 6       23065 2000   addm  WV-West Virginia              Low

LR Workshop Task: 2. b. Logistic regression (LR) Binary Class. (Reuse Multiple Polynomial Model on categorical dependent variable.)

table(ASD_State_4_LR_Risk2$Prevalence_Risk2)
## 
##  Low High 
##  740  952
barplot(table(ASD_State_4_LR_Risk2$Prevalence_Risk2))

counts = table(ASD_State_4_LR_Risk2$Prevalence_Risk2, ASD_State_4_LR_Risk2$Source)
counts
##       
##        addm medi nsch sped
##   Low     5  344    0  391
##   High   81  311   98  462
barplot(counts,
        main="Prevalence by Data Sources and Risk Levels",
        xlab="Data Sources",
        ylab="Occurrences",
        col=c("white", "lightgrey"),
        legend = rownames(counts), 
        args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 4))

str(ASD_State_4_LR_Risk2)
## 'data.frame':    1692 obs. of  5 variables:
##  $ Denominator     : int  45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
##  $ Year            : int  2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
##  $ Source          : Factor w/ 4 levels "addm","medi",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ State_Full2     : Factor w/ 51 levels "AK-Alaska","AL-Alabama",..: 4 11 21 32 41 50 2 4 3 6 ...
##  $ Prevalence_Risk2: Ord.factor w/ 2 levels "Low"<"High": 2 2 2 2 2 1 1 2 2 2 ...

Build model

# Binary Classification:
fit_model_LR_Risk2 = glm(Prevalence_Risk2 ~ Denominator + Year + Source + State_Full2 + I(Year^2) + I(log(Denominator)), 
                         family=binomial(link='logit'), data = ASD_State_4_LR_Risk2)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#
summary(fit_model_LR_Risk2)
## 
## Call:
## glm(formula = Prevalence_Risk2 ~ Denominator + Year + Source + 
##     State_Full2 + I(Year^2) + I(log(Denominator)), family = binomial(link = "logit"), 
##     data = ASD_State_4_LR_Risk2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1715  -0.2765   0.0036   0.2843   2.9153  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         3.555e+04  2.844e+04   1.250 0.211249    
## Denominator                         3.855e-06  5.131e-07   7.514 5.75e-14 ***
## Year                               -3.614e+01  2.834e+01  -1.275 0.202284    
## Sourcemedi                          1.381e+00  1.290e+00   1.071 0.284340    
## Sourcensch                          6.822e+00  4.151e+02   0.016 0.986887    
## Sourcesped                          1.242e+00  1.610e+00   0.771 0.440412    
## State_Full2AL-Alabama               2.623e+00  1.182e+00   2.219 0.026454 *  
## State_Full2AR-Arkansas              3.721e+00  1.108e+00   3.359 0.000781 ***
## State_Full2AZ-Arizona               4.067e+00  1.303e+00   3.122 0.001795 ** 
## State_Full2CA-California           -7.922e+00  2.916e+00  -2.717 0.006595 ** 
## State_Full2CO-Colorado              6.498e-01  1.124e+00   0.578 0.563137    
## State_Full2CT-Connecticut           2.939e+00  1.068e+00   2.752 0.005925 ** 
## State_Full2DC-District of Columbia -2.947e+00  9.614e-01  -3.066 0.002171 ** 
## State_Full2DE-Delaware             -1.158e+00  8.549e-01  -1.355 0.175515    
## State_Full2FL-Florida              -5.360e-02  1.672e+00  -0.032 0.974422    
## State_Full2GA-Georgia               3.631e+00  1.426e+00   2.547 0.010879 *  
## State_Full2HI-Hawaii               -9.970e-01  8.746e-01  -1.140 0.254317    
## State_Full2IA-Iowa                 -7.461e-01  1.050e+00  -0.710 0.477531    
## State_Full2ID-Idaho                 3.614e+00  9.177e-01   3.938 8.21e-05 ***
## State_Full2IL-Illinois              2.320e+00  1.604e+00   1.446 0.148090    
## State_Full2IN-Indiana               5.539e+00  1.302e+00   4.253 2.11e-05 ***
## State_Full2KS-Kansas                2.352e+00  9.880e-01   2.381 0.017270 *  
## State_Full2KY-Kentucky              2.421e+00  1.167e+00   2.075 0.037999 *  
## State_Full2LA-Louisiana            -1.964e-01  1.277e+00  -0.154 0.877789    
## State_Full2MA-Massachusetts         3.239e+00  1.237e+00   2.619 0.008814 ** 
## State_Full2MD-Maryland              5.419e+00  1.230e+00   4.405 1.06e-05 ***
## State_Full2ME-Maine                 4.860e+00  1.002e+00   4.852 1.22e-06 ***
## State_Full2MI-Michigan              6.864e+00  1.528e+00   4.493 7.03e-06 ***
## State_Full2MN-Minnesota             8.257e+00  1.399e+00   5.904 3.55e-09 ***
## State_Full2MO-Missouri              4.564e+00  1.282e+00   3.561 0.000370 ***
## State_Full2MS-Mississippi           1.114e-01  1.109e+00   0.100 0.919981    
## State_Full2MT-Montana              -9.584e-01  8.950e-01  -1.071 0.284242    
## State_Full2NC-North Carolina        4.455e+00  1.411e+00   3.157 0.001595 ** 
## State_Full2ND-North Dakota         -5.683e-01  9.133e-01  -0.622 0.533771    
## State_Full2NE-Nebraska             -8.605e-01  9.379e-01  -0.917 0.358910    
## State_Full2NH-New Hampshire         2.232e+00  8.838e-01   2.526 0.011541 *  
## State_Full2NJ-New Jersey            4.045e+00  1.301e+00   3.109 0.001880 ** 
## State_Full2NM-New Mexico           -7.265e-01  1.058e+00  -0.687 0.492243    
## State_Full2NV-Nevada                6.972e-02  9.263e-01   0.075 0.940005    
## State_Full2NY-New York              1.409e+00  1.736e+00   0.812 0.417034    
## State_Full2OH-Ohio                  3.032e+00  1.527e+00   1.985 0.047092 *  
## State_Full2OK-Oklahoma              1.284e+00  1.171e+00   1.097 0.272786    
## State_Full2OR-Oregon                7.916e+00  1.312e+00   6.033 1.61e-09 ***
## State_Full2PA-Pennsylvania          3.412e+00  1.485e+00   2.297 0.021595 *  
## State_Full2RI-Rhode Island          3.524e+00  9.128e-01   3.861 0.000113 ***
## State_Full2SC-South Carolina        2.740e+00  1.172e+00   2.339 0.019357 *  
## State_Full2SD-South Dakota         -3.510e+00  9.095e-01  -3.859 0.000114 ***
## State_Full2TN-Tennessee             1.575e+00  1.310e+00   1.202 0.229375    
## State_Full2TX-Texas                -4.908e+00  2.341e+00  -2.097 0.035995 *  
## State_Full2UT-Utah                  1.130e+00  9.597e-01   1.177 0.239101    
## State_Full2VA-Virginia              3.173e+00  1.278e+00   2.482 0.013077 *  
## State_Full2VT-Vermont               1.972e+00  9.016e-01   2.188 0.028703 *  
## State_Full2WA-Washington            1.927e+00  1.312e+00   1.469 0.141841    
## State_Full2WI-Wisconsin             6.754e+00  1.253e+00   5.391 7.02e-08 ***
## State_Full2WV-West Virginia         2.887e+00  9.484e-01   3.044 0.002334 ** 
## State_Full2WY-Wyoming              -1.637e+00  9.034e-01  -1.813 0.069900 .  
## I(Year^2)                           9.189e-03  7.062e-03   1.301 0.193207    
## I(log(Denominator))                -2.975e+00  5.033e-01  -5.912 3.39e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2319.0  on 1691  degrees of freedom
## Residual deviance:  840.9  on 1634  degrees of freedom
## AIC: 956.9
## 
## Number of Fisher Scoring iterations: 17

Evaluate model

# Likelihood ratio test: significance of the difference between the full model and the null model.
pchisq(fit_model_LR_Risk2$null.deviance - fit_model_LR_Risk2$deviance, 
       fit_model_LR_Risk2$df.null - fit_model_LR_Risk2$df.residual, lower.tail = FALSE)
## [1] 1.533547e-271

Check whether above value is very small (the smaller the more significant), e.g. < 0.05.

< How to perform a Logistic Regression in R > Michy Alice

https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/

# null deviance and the residual deviance
anova(fit_model_LR_Risk2, test="Chisq")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Prevalence_Risk2
## 
## Terms added sequentially (first to last)
## 
## 
##                     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                 1691    2318.98              
## Denominator          1     3.54      1690    2315.44   0.06000 .  
## Year                 1   799.97      1689    1515.47 < 2.2e-16 ***
## Source               3   130.38      1686    1385.09 < 2.2e-16 ***
## State_Full2         50   503.15      1636     881.94 < 2.2e-16 ***
## I(Year^2)            1     3.12      1635     878.81   0.07711 .  
## I(log(Denominator))  1    37.91      1634     840.90 7.391e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# R^2 equivalent
if(!require(pscl)){install.packages("pscl")}
## Loading required package: pscl
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library("pscl")

While no exact equivalent to the \(R^2\) of linear regression exists, the McFadden \(R^2\) index can be used to assess the model fit.

# R^2 equivalent
pR2(fit_model_LR_Risk2)[4]
##  McFadden 
## 0.6373833
<p>
    McFadden $R^2$ = 0.6374
</p>

LR Workshop Task: 3. c. Logistic regression (LR) Muti-Class. (Reuse Multiple Polynomial Model on categorical dependent variable.)

table(ASD_State_4_LR_Risk4$Prevalence_Risk4)
## 
##       Low    Medium      High Very High 
##       740       584       294        74
barplot(table(ASD_State_4_LR_Risk4$Prevalence_Risk4))

counts = table(ASD_State_4_LR_Risk4$Prevalence_Risk4, ASD_State_4_LR_Risk4$Source)
counts
##            
##             addm medi nsch sped
##   Low          5  344    0  391
##   Medium      36  225    5  318
##   High        38   74   38  144
##   Very High    7   12   55    0
barplot(counts,
        main="Prevalence Occurrence by Source and Risk",
        xlab="Data Sources",
        ylab="Occurrences",
        col=c("lightyellow", "orange", "red","darkred"),
        legend = rownames(counts), 
        args.legend = list(x = "topleft", bty = "n", cex = 0.85, y.intersp = 4))

Build model

# multinom function from the nnet package 
if(!require(nnet)){install.packages("nnet")}
## Loading required package: nnet
library("nnet")
# Multi-Class Classification:
fit_model_LR_Risk4 = multinom(Prevalence_Risk4 ~ Denominator + Year + Source + State_Full2 + I(Year^2) + I(log(Denominator)), 
                     data = ASD_State_4_LR_Risk4, maxit=1000) # maxit https://cran.r-project.org/web/packages/nnet/nnet.pdf
## # weights:  236 (174 variable)
## initial  value 2345.610059 
## iter  10 value 1642.176333
## iter  20 value 1287.779160
## iter  30 value 1107.430631
## iter  40 value 921.386084
## iter  50 value 890.347795
## iter  60 value 812.517810
## iter  70 value 808.708251
## iter  80 value 790.145541
## iter  90 value 789.356983
## iter  90 value 789.356983
## iter 100 value 788.697764
## iter 110 value 788.263896
## iter 120 value 787.147810
## iter 120 value 787.147808
## iter 130 value 785.704645
## iter 130 value 785.704644
## iter 140 value 785.135020
## iter 140 value 785.135020
## iter 150 value 784.905675
## iter 150 value 784.905675
## iter 160 value 784.812617
## iter 160 value 784.812617
## iter 170 value 784.774742
## iter 170 value 784.774742
## iter 180 value 784.759309
## iter 180 value 784.759309
## iter 190 value 784.753016
## iter 190 value 784.753016
## iter 200 value 784.750451
## iter 200 value 784.750451
## iter 210 value 784.749404
## iter 210 value 784.749404
## iter 220 value 784.748978
## iter 220 value 784.748978
## iter 230 value 784.748804
## iter 230 value 784.748804
## final  value 784.748760 
## converged
summary(fit_model_LR_Risk4)
## Call:
## multinom(formula = Prevalence_Risk4 ~ Denominator + Year + Source + 
##     State_Full2 + I(Year^2) + I(log(Denominator)), data = ASD_State_4_LR_Risk4, 
##     maxit = 1000)
## 
## Coefficients:
##             (Intercept)  Denominator       Year Sourcemedi Sourcensch
## Medium    -0.0004007338 2.315392e-06 -0.6177611 -0.3983914   1.288252
## High      -0.0012785664 4.217311e-06 -1.1823721 -2.5260908   0.260585
## Very High -0.0011016480 4.541402e-06 -1.6324251  2.4329304  -3.579447
##           Sourcesped State_Full2AL-Alabama State_Full2AR-Arkansas
## Medium    -0.5741604              0.259656             1.55153213
## High      -4.1602188              1.793057            -0.04348038
## Very High -6.7861902              6.666128            -2.13524747
##           State_Full2AZ-Arizona State_Full2CA-California State_Full2CO-Colorado
## Medium                1.6167595                -4.909023             -0.9979121
## High                  2.5982487                -9.553742             -2.3874541
## Very High             0.5183466                -8.552350             -6.6330509
##           State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium                    0.6888067                          -3.360576
## High                      2.8704430                          -3.307447
## Very High                 0.9797351                          -5.173241
##           State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium                 -2.053038             -1.298922             1.3332883
## High                   -2.057146             -2.591858             1.0150253
## Very High              -2.146310             -4.002177             0.7796848
##           State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium               -1.906098          -2.279007            1.770407
## High                 -3.940876          -2.758616            3.491865
## Very High            -7.794271          -4.414376            6.716912
##           State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium                 0.4698186              2.900020            0.6011796
## High                  -0.7865485              5.099685           -1.7097399
## Very High              1.9329550              5.232929           -4.0113460
##           State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium                 0.2443813               -1.898521
## High                   0.1226346               -3.786383
## Very High              1.4148624               -5.282611
##           State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium                      0.6339059               2.920329
## High                        3.1676739               4.244674
## Very High                   3.9373067               3.480534
##           State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium               2.486395               4.418793                5.227256
## High                 8.020376               5.139910               11.144972
## Very High           11.401158               3.322766               11.611459
##           State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium                  1.972230                 -1.556161
## High                    3.289085                 -5.205541
## Very High               3.521192                 -1.477672
##           State_Full2MT-Montana State_Full2NC-North Carolina
## Medium                -1.877082                    2.0851871
## High                  -0.197909                    2.4901862
## Very High             -1.285098                    0.5156278
##           State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium                    -1.0216012              -2.189739
## High                       0.3656576              -2.024072
## Very High                 -4.8018843              -4.879143
##           State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium                      0.7622948                 1.584463
## High                        3.7620099                 3.174298
## Very High                   1.4202079                 7.779043
##           State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium                   -2.207321           -1.4251488           -0.003734705
## High                     -5.688137           -0.9769902           -1.027852534
## Very High                -2.098040           -1.6479051           -2.507489677
##           State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium             0.8580716             -0.6403649             5.188302
## High               1.4962143             -2.0139406             8.107611
## Very High          2.9609860             -4.0221542             8.045266
##           State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium                      1.091974                   1.824172
## High                        2.409973                   5.986608
## Very High                   3.450578                   4.974181
##           State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium                       0.6450362                  -3.953263
## High                        -0.5909618                  -6.017699
## Very High                   -2.4215065                  -7.979805
##           State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium                 -0.3863012           -3.989071         -0.4175548
## High                   -2.5265653          -10.100360         -1.6451886
## Very High              -4.1210715           -1.627015         -2.3027877
##           State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium                 0.7890874              0.881496               -0.2522830
## High                   2.7623520              3.880883               -0.4083035
## Very High              0.9597204              3.867580               -0.1751737
##           State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium                   4.249644                   0.9415586
## High                     4.899023                   2.0562665
## Very High                3.003220                   2.3448054
##           State_Full2WY-Wyoming    I(Year^2) I(log(Denominator))
## Medium                -1.916496 0.0003134994           -1.885972
## High                  -3.657686 0.0005966326           -2.642810
## Very High             -6.460238 0.0008259125           -5.130174
## 
## Std. Errors:
##            (Intercept)  Denominator         Year   Sourcemedi   Sourcensch
## Medium    6.504869e-15 1.016564e-07 1.302425e-11 1.114663e-12 2.335311e-13
## High      1.237153e-14 1.417345e-07 2.095135e-11 2.338936e-12 2.691731e-12
## Very High 1.556105e-14 1.663810e-06 3.208907e-11 3.298394e-12 2.925267e-12
##             Sourcesped State_Full2AL-Alabama State_Full2AR-Arkansas
## Medium    6.032300e-13          4.611155e-13           2.418736e-14
## High      4.842465e-13          5.054817e-13           4.399004e-14
## Very High 6.850639e-14          1.070782e-12           8.549246e-14
##           State_Full2AZ-Arizona State_Full2CA-California State_Full2CO-Colorado
## Medium             4.381850e-14             6.176288e-14           2.215907e-14
## High               3.545111e-14             1.570842e-13           4.494078e-15
## Very High          8.425035e-14             6.526380e-14           3.901093e-14
##           State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium                 3.297721e-14                       1.882122e-14
## High                   3.305750e-14                       4.937714e-14
## Very High              8.701018e-14                       1.871889e-15
##           State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium              2.759187e-15          3.693894e-14          3.524273e-14
## High                1.693835e-14          9.245737e-14          1.473422e-14
## Very High           4.743799e-14          7.911506e-14          8.362122e-15
##           State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium            1.813085e-14       1.733870e-15        1.204846e-13
## High              3.650909e-14       6.787245e-14        1.457238e-13
## Very High         4.725432e-14       8.730409e-14        2.455899e-13
##           State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium              1.520165e-14          2.951863e-14         5.667162e-15
## High                1.545507e-14          5.134649e-14         1.026072e-14
## Very High           8.946492e-15          8.284287e-14         2.487088e-14
##           State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium              8.399369e-15            1.042326e-14
## High                4.569650e-15            5.474020e-14
## Very High           5.752079e-15            6.911962e-14
##           State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium                   1.645795e-14           5.008280e-14
## High                     3.631590e-14           1.108605e-13
## Very High                1.206869e-14           1.689210e-13
##           State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium           1.553973e-14           2.953357e-14            4.294752e-14
## High             1.011270e-13           7.570489e-14            8.167658e-15
## Very High        6.169331e-14           8.877702e-14            4.157086e-14
##           State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium              4.204133e-14              2.101698e-14
## High                3.369476e-14              1.239228e-15
## Very High           8.024045e-14              1.186427e-14
##           State_Full2MT-Montana State_Full2NC-North Carolina
## Medium             4.020754e-14                 4.737792e-14
## High               1.932927e-14                 4.809309e-14
## Very High          9.331166e-14                 8.437890e-14
##           State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium                  4.733390e-14           6.068673e-15
## High                    4.818736e-14           1.702983e-14
## Very High               3.160265e-14           1.743880e-14
##           State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium                   4.550843e-14             4.730878e-13
## High                     3.099444e-15             5.256319e-13
## Very High                7.715746e-14             1.038871e-12
##           State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium                2.228400e-14         7.112411e-15           4.630727e-14
## High                  1.536900e-15         3.405749e-14           9.901473e-14
## Very High             1.163183e-14         9.852721e-16           8.721041e-14
##           State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium          1.598397e-14           1.239635e-14         4.717532e-14
## High            1.401742e-14           5.308381e-14         2.557711e-15
## Very High       2.210444e-15           7.628757e-14         5.286462e-14
##           State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium                  4.030365e-14               4.488620e-14
## High                    1.752958e-14               2.402991e-14
## Very High               3.561026e-14               5.043292e-14
##           State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium                    1.557550e-14               9.401830e-15
## High                      4.641252e-14               1.658577e-14
## Very High                 7.568193e-14               3.571812e-14
##           State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium               2.154977e-14        1.406481e-14       2.396506e-14
## High                 5.974267e-14        3.518306e-14       9.415083e-14
## Very High            7.992758e-14        1.125899e-13       1.401813e-13
##           State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium              2.544850e-14          4.620749e-14             2.766294e-14
## High                5.860562e-14          2.523099e-14             6.774583e-14
## Very High           8.674593e-14          9.937702e-14             9.407003e-14
##           State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium               5.186528e-14                5.540629e-15
## High                 4.334490e-14                3.743788e-14
## Very High            1.007639e-13                1.454464e-14
##           State_Full2WY-Wyoming    I(Year^2) I(log(Denominator))
## Medium             6.722288e-15 2.610422e-08        1.928104e-12
## High               1.023602e-14 3.820346e-08        1.410502e-11
## Very High          2.508895e-14 6.797764e-08        1.786641e-11
## 
## Residual Deviance: 1569.498 
## AIC: 1917.498

< MULTINOMIAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES >

https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/

## extract the coefficients from the model and exponentiate
exp(coef(fit_model_LR_Risk4))
##           (Intercept) Denominator      Year  Sourcemedi Sourcensch  Sourcesped
## Medium      0.9995993    1.000002 0.5391502  0.67139922 3.62644214 0.563177522
## High        0.9987223    1.000004 0.3065507  0.07997104 1.29768902 0.015604143
## Very High   0.9988990    1.000005 0.1954550 11.39221703 0.02789111 0.001129263
##           State_Full2AL-Alabama State_Full2AR-Arkansas State_Full2AZ-Arizona
## Medium                 1.296484              4.7186943              5.036742
## High                   6.007789              0.9574513             13.440179
## Very High            785.348612              0.1182153              1.679249
##           State_Full2CA-California State_Full2CO-Colorado
## Medium                7.379693e-03            0.368648343
## High                  7.093535e-05            0.091863261
## Very High             1.930908e-04            0.001316142
##           State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium                     1.991338                        0.034715260
## High                      17.644834                        0.036609512
## Very High                  2.663751                        0.005666173
##           State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium                 0.1283444            0.27282579              3.793497
## High                   0.1278183            0.07488077              2.759433
## Very High              0.1169147            0.01827580              2.180785
##           State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium             0.148659345         0.10238578            5.873246
## High               0.019431181         0.06337940           32.847135
## Very High          0.000412089         0.01210211          826.261954
##           State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium                 1.5997040               18.1745            1.8242694
## High                   0.4554139              163.9703            0.1809128
## Very High              6.9098986              187.3408            0.0181090
##           State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium                  1.276831             0.149790014
## High                    1.130471             0.022677484
## Very High               4.115920             0.005079153
##           State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium                       1.884959               18.54739
## High                        23.752169               69.73300
## Very High                   51.280300               32.47705
##           State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium               12.01788               82.99607                186.2809
## High               3042.32231              170.70035              69214.9296
## Very High         89425.22954               27.73696             110355.1501
##           State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium                  7.186686               0.210944310
## High                   26.818327               0.005486084
## Very High              33.824707               0.228168268
##           State_Full2MT-Montana State_Full2NC-North Carolina
## Medium                0.1530360                     8.046097
## High                  0.8204445                    12.063522
## Very High             0.2766236                     1.674689
##           State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium                   0.360018027            0.111945929
## High                     1.441461648            0.132116387
## Very High                0.008214255            0.007603527
##           State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium                       2.143189                 4.876673
## High                        43.034836                23.910026
## Very High                    4.137981              2389.985984
##           State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium                 0.109994977            0.2404727             0.99627226
## High                   0.003385895            0.3764424             0.35777444
## Very High              0.122696725            0.1924526             0.08147251
##           State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium              2.358608             0.52710006              179.164
## High                4.464755             0.13346171             3319.637
## Very High          19.317009             0.01791433             3118.994
##           State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium                       2.98015                   6.197663
## High                        11.13366                 398.062173
## Very High                   31.51861                 144.630278
##           State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium                      1.90605600               0.0191919800
## High                        0.55379441               0.0024352668
## Very High                   0.08878776               0.0003423061
##           State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium                 0.67956583        1.851691e-02         0.65865543
## High                   0.07993310        4.106477e-05         0.19297616
## Very High              0.01622712        1.965154e-01         0.09997974
##           State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium                  2.201387              2.414509                0.7770248
## High                   15.837048             48.466974                0.6647771
## Very High               2.610966             47.826518                0.8393112
##           State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium                   70.08045                    2.563975
## High                    134.15860                    7.816731
## Very High                20.15031                   10.431243
##           State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium              0.147121636  1.000314          0.15168154
## High                0.025792134  1.000597          0.07116099
## Very High           0.001564424  1.000826          0.00591553
# Uncomment below to display all p values:
# paste(exp(coef(fit_model)))
# Test the significance/importance of each coefficient (check if p values < 0.05)

# z score for coefficients
z <- summary(fit_model_LR_Risk4)$coefficients/summary(fit_model_LR_Risk4)$standard.errors
cat('\n< Talbe of coefficient z scores>')
## 
## < Talbe of coefficient z scores>
z
##             (Intercept) Denominator         Year    Sourcemedi    Sourcensch
## Medium     -61605204301    22.77665 -47431585790 -3.574096e+11  5.516405e+12
## High      -103347504827    29.75500 -56434165146 -1.080017e+12  9.680944e+10
## Very High  -70795238650     2.72952 -50871679968  7.376106e+11 -1.223631e+12
##              Sourcesped State_Full2AL-Alabama State_Full2AR-Arkansas
## Medium    -9.518100e+11          5.631040e+11           6.414641e+13
## High      -8.591119e+12          3.547224e+12          -9.884141e+11
## Very High -9.905923e+13          6.225476e+12          -2.497586e+13
##           State_Full2AZ-Arizona State_Full2CA-California State_Full2CO-Colorado
## Medium             3.689674e+13            -7.948178e+13          -4.503402e+13
## High               7.329104e+13            -6.081924e+13          -5.312445e+14
## Very High          6.152456e+12            -1.310428e+14          -1.700306e+14
##           State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium                 2.088735e+13                      -1.785525e+14
## High                   8.683181e+13                      -6.698337e+13
## Very High              1.126001e+13                      -2.763647e+15
##           State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium             -7.440734e+14         -3.516403e+13          3.783159e+13
## High               -1.214490e+14         -2.803301e+13          6.888899e+13
## Very High          -4.524454e+13         -5.058680e+13          9.324007e+13
##           State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium           -1.051301e+14      -1.314405e+15        1.469405e+13
## High             -1.079423e+14      -4.064413e+13        2.396221e+13
## Very High        -1.649430e+14      -5.056322e+13        2.735012e+13
##           State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium              3.090576e+13          9.824372e+13         1.060812e+14
## High               -5.089257e+13          9.931906e+13        -1.666297e+14
## Very High           2.160573e+14          6.316692e+13        -1.612869e+14
##           State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium              2.909520e+13           -1.821427e+14
## High                2.683677e+13           -6.917006e+13
## Very High           2.459741e+14           -7.642708e+13
##           State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium                   3.851670e+13           5.831002e+13
## High                     8.722554e+13           3.828842e+13
## Very High                3.262415e+14           2.060451e+13
##           State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium           1.600025e+14           1.496193e+14            1.217126e+14
## High             7.930991e+13           6.789402e+13            1.364525e+15
## Very High        1.848038e+14           3.742822e+13            2.793172e+14
##           State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium              4.691169e+13             -7.404304e+13
## High                9.761416e+13             -4.200632e+15
## Very High           4.388300e+13             -1.245481e+14
##           State_Full2MT-Montana State_Full2NC-North Carolina
## Medium            -4.668484e+13                 4.401179e+13
## High              -1.023882e+13                 5.177846e+13
## Very High         -1.377210e+13                 6.110861e+12
##           State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium                 -2.158286e+13          -3.608267e+14
## High                    7.588249e+12          -1.188545e+14
## Very High              -1.519456e+14          -2.797867e+14
##           State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium                   1.675063e+13             3.349195e+12
## High                     1.213769e+15             6.039013e+12
## Very High                1.840662e+13             7.487980e+12
##           State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium               -9.905406e+13        -2.003749e+14          -8.065052e+10
## High                 -3.701046e+15        -2.868650e+13          -1.038080e+13
## Very High            -1.803705e+14        -1.672538e+15          -2.875218e+13
##           State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium          5.368327e+13          -5.165753e+13         1.099791e+14
## High            1.067396e+14          -3.793888e+13         3.169869e+15
## Very High       1.339543e+15          -5.272359e+13         1.521862e+14
##           State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium                  2.709366e+13               4.063994e+13
## High                    1.374803e+14               2.491315e+14
## Very High               9.689842e+13               9.862965e+13
##           State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium                    4.141352e+13              -4.204780e+14
## High                     -1.273281e+13              -3.628230e+14
## Very High                -3.199583e+13              -2.234106e+14
##           State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium              -1.792600e+13       -2.836206e+14      -1.742348e+13
## High                -4.229080e+13       -2.870802e+14      -1.747397e+13
## Very High           -5.156007e+13       -1.445080e+13      -1.642721e+13
##           State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium              3.100723e+13          1.907691e+13            -9.119890e+12
## High                4.713459e+13          1.538141e+14            -6.026990e+12
## Very High           1.106358e+13          3.891825e+13            -1.862163e+12
##           State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium               8.193619e+13                1.699371e+14
## High                 1.130242e+14                5.492476e+13
## Very High            2.980451e+13                1.612144e+14
##           State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium            -2.850957e+14  12009.53       -978148399524
## High              -3.573347e+14  15617.24       -187366682293
## Very High         -2.574934e+14  12149.77       -287140725316
# p value of 2-tailed z test
p <- (1 - pnorm(abs(z), 0, 1)) * 2
cat('\n< Talbe of coefficient p values>')
## 
## < Talbe of coefficient p values>
p
##           (Intercept) Denominator Year Sourcemedi Sourcensch Sourcesped
## Medium              0 0.000000000    0          0          0          0
## High                0 0.000000000    0          0          0          0
## Very High           0 0.006342669    0          0          0          0
##           State_Full2AL-Alabama State_Full2AR-Arkansas State_Full2AZ-Arizona
## Medium                        0                      0                     0
## High                          0                      0                     0
## Very High                     0                      0                     0
##           State_Full2CA-California State_Full2CO-Colorado
## Medium                           0                      0
## High                             0                      0
## Very High                        0                      0
##           State_Full2CT-Connecticut State_Full2DC-District of Columbia
## Medium                            0                                  0
## High                              0                                  0
## Very High                         0                                  0
##           State_Full2DE-Delaware State_Full2FL-Florida State_Full2GA-Georgia
## Medium                         0                     0                     0
## High                           0                     0                     0
## Very High                      0                     0                     0
##           State_Full2HI-Hawaii State_Full2IA-Iowa State_Full2ID-Idaho
## Medium                       0                  0                   0
## High                         0                  0                   0
## Very High                    0                  0                   0
##           State_Full2IL-Illinois State_Full2IN-Indiana State_Full2KS-Kansas
## Medium                         0                     0                    0
## High                           0                     0                    0
## Very High                      0                     0                    0
##           State_Full2KY-Kentucky State_Full2LA-Louisiana
## Medium                         0                       0
## High                           0                       0
## Very High                      0                       0
##           State_Full2MA-Massachusetts State_Full2MD-Maryland
## Medium                              0                      0
## High                                0                      0
## Very High                           0                      0
##           State_Full2ME-Maine State_Full2MI-Michigan State_Full2MN-Minnesota
## Medium                      0                      0                       0
## High                        0                      0                       0
## Very High                   0                      0                       0
##           State_Full2MO-Missouri State_Full2MS-Mississippi
## Medium                         0                         0
## High                           0                         0
## Very High                      0                         0
##           State_Full2MT-Montana State_Full2NC-North Carolina
## Medium                        0                            0
## High                          0                            0
## Very High                     0                            0
##           State_Full2ND-North Dakota State_Full2NE-Nebraska
## Medium                             0                      0
## High                               0                      0
## Very High                          0                      0
##           State_Full2NH-New Hampshire State_Full2NJ-New Jersey
## Medium                              0                        0
## High                                0                        0
## Very High                           0                        0
##           State_Full2NM-New Mexico State_Full2NV-Nevada State_Full2NY-New York
## Medium                           0                    0                      0
## High                             0                    0                      0
## Very High                        0                    0                      0
##           State_Full2OH-Ohio State_Full2OK-Oklahoma State_Full2OR-Oregon
## Medium                     0                      0                    0
## High                       0                      0                    0
## Very High                  0                      0                    0
##           State_Full2PA-Pennsylvania State_Full2RI-Rhode Island
## Medium                             0                          0
## High                               0                          0
## Very High                          0                          0
##           State_Full2SC-South Carolina State_Full2SD-South Dakota
## Medium                               0                          0
## High                                 0                          0
## Very High                            0                          0
##           State_Full2TN-Tennessee State_Full2TX-Texas State_Full2UT-Utah
## Medium                          0                   0                  0
## High                            0                   0                  0
## Very High                       0                   0                  0
##           State_Full2VA-Virginia State_Full2VT-Vermont State_Full2WA-Washington
## Medium                         0                     0                        0
## High                           0                     0                        0
## Very High                      0                     0                        0
##           State_Full2WI-Wisconsin State_Full2WV-West Virginia
## Medium                          0                           0
## High                            0                           0
## Very High                       0                           0
##           State_Full2WY-Wyoming I(Year^2) I(log(Denominator))
## Medium                        0         0                   0
## High                          0         0                   0
## Very High                     0         0                   0
# Uncomment below to display all p values:
# paste(p)

While no exact equivalent to the \(R^2\) of linear regression exists, the McFadden \(R^2\) index can be used to assess the model fit.

# R^2 equivalent
pR2(fit_model_LR_Risk4)[4]
## fitting null model for pseudo-r2
## # weights:  8 (3 variable)
## initial  value 2345.610059 
## iter  10 value 1979.347988
## final  value 1979.347206 
## converged
##  McFadden 
## 0.6035315
<p>
    McFadden $R^2$ = 0.6035
</p>
<a href="">
     <img src="" width="750" align="center">
</a>

Linear Model: Model Evaluation

<h3>
Linear Model: Linear Model: Model Evaluation - Workshop Task
</h3>

Workshop Task:

    1. Train/Test Dataset Split
    1. Confusion Matrix & Accuracy for Classification
    1. K-Fold Cross Validation
    1. \(R^2\), MSE, RMSE for Regression for Regression

Model Evaluation Workshop Task: 1. a. Train/Test Dataset Split

if(!require(caTools)){install.packages("caTools")}
## Loading required package: caTools
library("caTools") 
# Generate a random number sequence that can be reproduced to check results thru the seed number.
set.seed(88)

# Stratified Random Sampling: split dataset into two sets in predefined proportion (SplitRatio)
# while preserving differnt class ratios of dependent variable. (e.g. Proportion of Low/High)
split <- sample.split(ASD_State_4_LR_Risk2$Prevalence_Risk2, SplitRatio = 0.7)
# Get training and test data
trainset <- subset(ASD_State_4_LR_Risk2, split == TRUE) 
testset <- subset(ASD_State_4_LR_Risk2, split == FALSE)

Build a binary classification model to predict (categorical) Prevalence Risk Level using Logistic Regression (LR)

# Binary Classification:
fit_model_LR_Risk2 = glm(Prevalence_Risk2 ~ Denominator + Year + Source + State_Full2 + I(Year^2) + I(log(Denominator)), 
                         family=binomial(link='logit'), data = trainset) # data = trainset
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit_model_LR_Risk2)
## 
## Call:
## glm(formula = Prevalence_Risk2 ~ Denominator + Year + Source + 
##     State_Full2 + I(Year^2) + I(log(Denominator)), family = binomial(link = "logit"), 
##     data = trainset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2887  -0.2840   0.0000   0.2573   2.9945  
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         5.763e+04  3.538e+04   1.629 0.103311    
## Denominator                         4.162e-06  6.157e-07   6.760 1.38e-11 ***
## Year                               -5.815e+01  3.526e+01  -1.649 0.099171 .  
## Sourcemedi                          1.118e+00  1.445e+00   0.774 0.439156    
## Sourcensch                          1.054e+01  7.542e+02   0.014 0.988849    
## Sourcesped                          7.207e-01  1.804e+00   0.399 0.689548    
## State_Full2AL-Alabama               1.431e+00  1.343e+00   1.066 0.286634    
## State_Full2AR-Arkansas              3.125e+00  1.264e+00   2.472 0.013421 *  
## State_Full2AZ-Arizona               3.938e+00  1.491e+00   2.641 0.008263 ** 
## State_Full2CA-California           -1.017e+01  3.302e+00  -3.079 0.002076 ** 
## State_Full2CO-Colorado              3.551e-01  1.310e+00   0.271 0.786395    
## State_Full2CT-Connecticut           2.313e+00  1.167e+00   1.982 0.047432 *  
## State_Full2DC-District of Columbia -3.179e+00  1.050e+00  -3.029 0.002457 ** 
## State_Full2DE-Delaware             -1.295e+00  1.093e+00  -1.185 0.236160    
## State_Full2FL-Florida              -1.461e+00  1.841e+00  -0.794 0.427311    
## State_Full2GA-Georgia               2.816e+00  1.582e+00   1.780 0.075089 .  
## State_Full2HI-Hawaii               -5.153e-01  9.723e-01  -0.530 0.596133    
## State_Full2IA-Iowa                 -1.417e+00  1.175e+00  -1.205 0.228072    
## State_Full2ID-Idaho                 3.280e+00  1.034e+00   3.171 0.001519 ** 
## State_Full2IL-Illinois              1.257e+00  1.800e+00   0.698 0.484964    
## State_Full2IN-Indiana               5.164e+00  1.490e+00   3.465 0.000530 ***
## State_Full2KS-Kansas                1.823e+00  1.128e+00   1.616 0.106019    
## State_Full2KY-Kentucky              2.163e+00  1.366e+00   1.583 0.113318    
## State_Full2LA-Louisiana            -8.178e-01  1.387e+00  -0.589 0.555579    
## State_Full2MA-Massachusetts         2.647e+00  1.391e+00   1.904 0.056939 .  
## State_Full2MD-Maryland              5.012e+00  1.410e+00   3.556 0.000377 ***
## State_Full2ME-Maine                 4.508e+00  1.062e+00   4.246 2.18e-05 ***
## State_Full2MI-Michigan              7.193e+00  1.754e+00   4.100 4.14e-05 ***
## State_Full2MN-Minnesota             7.660e+00  1.628e+00   4.705 2.53e-06 ***
## State_Full2MO-Missouri              3.821e+00  1.449e+00   2.637 0.008354 ** 
## State_Full2MS-Mississippi          -2.675e-01  1.284e+00  -0.208 0.834937    
## State_Full2MT-Montana              -5.011e-01  9.449e-01  -0.530 0.595874    
## State_Full2NC-North Carolina        3.973e+00  1.597e+00   2.488 0.012848 *  
## State_Full2ND-North Dakota         -4.695e-01  9.938e-01  -0.472 0.636644    
## State_Full2NE-Nebraska             -6.207e-01  1.069e+00  -0.581 0.561543    
## State_Full2NH-New Hampshire         2.147e+00  9.642e-01   2.227 0.025944 *  
## State_Full2NJ-New Jersey            3.282e+00  1.422e+00   2.307 0.021054 *  
## State_Full2NM-New Mexico           -1.324e+00  1.217e+00  -1.088 0.276734    
## State_Full2NV-Nevada               -6.188e-01  1.042e+00  -0.594 0.552734    
## State_Full2NY-New York             -9.957e-02  1.915e+00  -0.052 0.958533    
## State_Full2OH-Ohio                  1.624e+00  1.714e+00   0.947 0.343477    
## State_Full2OK-Oklahoma              6.620e-01  1.293e+00   0.512 0.608721    
## State_Full2OR-Oregon                2.351e+01  1.511e+03   0.016 0.987583    
## State_Full2PA-Pennsylvania          2.252e+00  1.628e+00   1.383 0.166624    
## State_Full2RI-Rhode Island          3.716e+00  1.116e+00   3.331 0.000866 ***
## State_Full2SC-South Carolina        2.326e+00  1.282e+00   1.815 0.069583 .  
## State_Full2SD-South Dakota         -3.282e+00  9.832e-01  -3.338 0.000843 ***
## State_Full2TN-Tennessee             4.190e-01  1.487e+00   0.282 0.778165    
## State_Full2TX-Texas                -7.006e+00  2.619e+00  -2.675 0.007466 ** 
## State_Full2UT-Utah                  9.209e-01  1.077e+00   0.855 0.392485    
## State_Full2VA-Virginia              2.424e+00  1.475e+00   1.643 0.100321    
## State_Full2VT-Vermont               2.200e+00  1.064e+00   2.067 0.038706 *  
## State_Full2WA-Washington            2.264e+00  1.465e+00   1.545 0.122328    
## State_Full2WI-Wisconsin             6.659e+00  1.414e+00   4.709 2.49e-06 ***
## State_Full2WV-West Virginia         2.207e+00  1.063e+00   2.075 0.037965 *  
## State_Full2WY-Wyoming              -1.696e+00  1.050e+00  -1.614 0.106509    
## I(Year^2)                           1.467e-02  8.788e-03   1.670 0.094995 .  
## I(log(Denominator))                -2.728e+00  5.762e-01  -4.734 2.20e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1622.82  on 1183  degrees of freedom
## Residual deviance:  579.43  on 1126  degrees of freedom
## AIC: 695.43
## 
## Number of Fisher Scoring iterations: 18

Model Evaluation Workshop Task: 2. b. Confusion Matrix & Accuracy for Classification

https://en.wikipedia.org/wiki/Confusion_matrix

# Confusion matrix on Trainset
probTrainset <- predict(fit_model_LR_Risk2, type = 'response')
# One way is to use the proportion of High risk in the *Training* data.
threshold2 <- sum(trainset$Prevalence_Risk2 == "High")/length(trainset$Prevalence_Risk2)
cat('Trainset High Risk Threshold = ', threshold2)
## Trainset High Risk Threshold =  0.5625
# If logistic regression probability > threshold, predict High, else predict Low.
predictTrainset <- ifelse(probTrainset > threshold2, "High", "Low")
# Create a contingency table (Confusion Matrix) with actuals on rows and predictions on columns.
table(trainset$Prevalence_Risk2, predictTrainset)
##       predictTrainset
##        High Low
##   Low    46 472
##   High  593  73
# Accuracy on Trainset
AccuracyTrain <- mean(predictTrainset == trainset$Prevalence_Risk2) 
cat('Trainset Accuracy = ', AccuracyTrain)
## Trainset Accuracy =  0.8994932
# Confusion matrix on Testset
probTestset <- predict(fit_model_LR_Risk2, newdata = testset, type = 'response') # newdata = testset
# One way is to use the proportion of High risk in the *Training* data.
cat('Reused Trainset High Risk Threshold = ', threshold2)
## Reused Trainset High Risk Threshold =  0.5625
# If logistic regression probability > threshold, predict High, else predict Low.
predictTestset <- ifelse(probTestset > threshold2, "High", "Low")
# Create a contingency table (Confusion Matrix) with actuals on rows and predictions on columns.
table(testset$Prevalence_Risk2, predictTestset)
##       predictTestset
##        High Low
##   Low    30 192
##   High  258  28
# Accuracy on Trainset
AccuracyTest <- mean(predictTestset == testset$Prevalence_Risk2) 
cat('Testset Accuracy = ', AccuracyTest)
## Testset Accuracy =  0.8858268

Model Evaluation Workshop Task: 3. c. K-Fold Cross Validation

R caret package

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

  • data splitting
  • pre-processing
  • feature selection
  • model tuning using resampling
  • variable importance estimation

http://topepo.github.io/caret/index.html

if(!require(caret)){install.packages("caret")}
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
library("caret")
if(!require(e1071)){install.packages("e1071")}
## Loading required package: e1071
library("e1071")
# Caret train/test split method:
set.seed(88)
caret_idx = createDataPartition(ASD_State_4_LR_Risk2$Prevalence_Risk2, p = 0.85, list = FALSE)
caret_trainset = ASD_State_4_LR_Risk2[caret_idx, ]
caret_testset  = ASD_State_4_LR_Risk2[-caret_idx, ]

#caret_threshold <- sum(caret_trainset$Prevalence_Risk2 == "Low")/length(caret_trainset$Prevalence_Risk2)
#caret_threshold
# Caret logistic regresion
set.seed(88)
cv_control = trainControl(method = "cv", number = 5)

caret_model_LR_Risk2 = train(form = Prevalence_Risk2 ~ ., data = caret_trainset, 
                             trControl = cv_control, method = "glm", family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#
caret_model_LR_Risk2
## Generalized Linear Model 
## 
## 1439 samples
##    4 predictors
##    2 classes: 'Low', 'High' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1151, 1151, 1151, 1151, 1152 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8638018  0.7237422
# Get predicted class label
caret_model_LR_Risk2_Pred <- predict(caret_model_LR_Risk2, caret_testset)

# Uncomment below to get class probability
# caret_model_LR_Risk2_Pred <- predict(caret_model_LR_Risk2, caret_testset, type = "prob")

https://topepo.github.io/caret/using-your-own-model-in-train.html#Illustration5

# CM & Acc
cm_table <- table(as.factor(caret_model_LR_Risk2_Pred), caret_testset$Prevalence_Risk2)
confusionMatrix(cm_table)
## Confusion Matrix and Statistics
## 
##       
##        Low High
##   Low   92   11
##   High  19  131
##                                           
##                Accuracy : 0.8814          
##                  95% CI : (0.8351, 0.9185)
##     No Information Rate : 0.5613          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7573          
##                                           
##  Mcnemar's Test P-Value : 0.2012          
##                                           
##             Sensitivity : 0.8288          
##             Specificity : 0.9225          
##          Pos Pred Value : 0.8932          
##          Neg Pred Value : 0.8733          
##              Prevalence : 0.4387          
##          Detection Rate : 0.3636          
##    Detection Prevalence : 0.4071          
##       Balanced Accuracy : 0.8757          
##                                           
##        'Positive' Class : Low             
## 

Model Evaluation Workshop Task: 4. d. \(R^2\), MSE, RMSE for Regression

# Caret train/test split method:
set.seed(88)
caret_idx = createDataPartition(ASD_State_4_MLR$Prevalence, p = 0.85, list = FALSE)
caret_trainset = ASD_State_4_MLR[caret_idx, ]
caret_testset  = ASD_State_4_MLR[-caret_idx, ]
# Look at below plots of train and test, shape of distribution are similar, which means good sampling.
par(mfrow=c(1, 2)) 
plot(density(caret_trainset$Prevalence))
plot(density(caret_testset$Prevalence))

par(mfrow=c(1, 1))

Build a regression model to predict (numeric) Prevalanece using Multiple Linear Regression (MLR)

if(!require(elasticnet)){install.packages("elasticnet")}
## Loading required package: elasticnet
## Loading required package: lars
## Loaded lars 1.2
library("elasticnet")
# Caret MLR regresion
set.seed(88)
cv_control = trainControl(method = "cv", number = 5)
#
caret_model_MLR <- train(Prevalence ~ ., data = caret_trainset, trControl = cv_control, method = "lm") 
#
caret_model_MLR
## Linear Regression 
## 
## 1440 samples
##    4 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1152, 1153, 1152, 1152, 1151 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2.960949  0.7460988  1.990705
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=8)
ggplot(varImp(caret_model_MLR))

# plot(varImp(caret_model_MLR))
caret_model_MLR_Pred <- predict(caret_model_MLR, caret_testset)
head(caret_model_MLR_Pred)
##        22        27        33        36        41        55 
##  9.389012  7.831519  9.670434 10.609628 12.070592 13.402663
# MSE
mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
## [1] 5.837448
# RMSE
sqrt(mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2))
## [1] 2.416081

Calculate \(R^2\) for entire train set (all folds) https://stackoverflow.com/questions/25691127/r-squared-on-test-data

caret_model_MLR_Pred_Train <- predict(caret_model_MLR, caret_trainset)

SS.total      <- sum((caret_trainset$Prevalence-mean(caret_trainset$Prevalence))^2)
SS.residual   <- sum(residuals(caret_model_MLR)^2)
SS.regression <- sum((caret_model_MLR_Pred_Train-mean(caret_trainset$Prevalence))^2)

# fraction of variability explained by the model
cat("\nTrain R^2 fraction of variability explained by the model :", SS.regression/SS.total )
## 
## Train R^2 fraction of variability explained by the model : 0.7699826
# caret_model_MLR_L2

Calculate \(R^2\) for test set https://stackoverflow.com/questions/25691127/r-squared-on-test-data

# True y values:
# head(caret_testset$Prevalence)

# Predicated y values (y hat):
# head(caret_model_MLR_Pred)

SS.total      <- sum((caret_testset$Prevalence - mean(caret_testset$Prevalence))^2)
SS.residual   <- sum((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
SS.regression <- sum((caret_model_MLR_Pred - mean(caret_testset$Prevalence))^2)

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
# cat("\nNOT Test fraction of variability explained by the model :", test.rsq )

# fraction of variability explained by the model
cat("\nTest R^2 fraction of variability explained by the model :", SS.regression/SS.total )
## 
## Test R^2 fraction of variability explained by the model : 0.8988214
<a href="">
     <img src="" width="750" align="center">
</a>

Linear Model: Prevent Overfitting by Regularization Methods

<h3>
Linear Model: Prevent Overfitting by Regularization Methods - L1 Lasso Regression & L2 Ridge regression
</h3>

https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c

  • L1 Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.

  • L2 Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

The key difference between these techniques is that L1 Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

https://towardsdatascience.com/create-predictive-models-in-r-with-caret-12baf9941236

Enhanced MLR using Regularization: L1 Lasso

if(!require(elasticnet)){install.packages("elasticnet")}
library("elasticnet")
# Caret MLR with Regularization
# possible method: boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV"
cv_control <- trainControl(method = "repeatedcv",   
                           number = 10,     # number of folds
                           repeats = 5)    # repeated N times

caret_model_MLR_L1 <- train(Prevalence ~ .,
               data = caret_trainset,
               method = "lasso",  # Try using "ridge"
               trControl = cv_control,
               preProcess = c('scale', 'center'))  # Auto pre-process data
#
caret_model_MLR_L1
## The lasso 
## 
## 1440 samples
##    4 predictors
## 
## Pre-processing: scaled (55), centered (55) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1296, 1295, 1296, 1296, 1296, 1297, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE      Rsquared   MAE     
##   0.1       4.594316  0.5964018  3.264456
##   0.5       3.158034  0.7178476  2.140953
##   0.9       2.944008  0.7526252  1.969675
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was fraction = 0.9.
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=2)
ggplot(varImp(caret_model_MLR_L1))

# plot(varImp(caret_model_MLR_L1))
caret_model_MLR_Pred <- predict(caret_model_MLR_L1, caret_testset)
head(caret_model_MLR_Pred)
##        22        27        33        36        41        55 
##  9.095595  7.657583  9.711652 10.420214 11.766911 13.097625
# MSE
mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
## [1] 5.900344
# RMSE
sqrt(mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2))
## [1] 2.429062

Calculate \(R^2\) for entire train set (all folds) https://stackoverflow.com/questions/25691127/r-squared-on-test-data

caret_model_MLR_Pred_Train <- predict(caret_model_MLR_L1, caret_trainset)

SS.total      <- sum((caret_trainset$Prevalence-mean(caret_trainset$Prevalence))^2)
SS.residual   <- sum(residuals(caret_model_MLR_L1)^2)
SS.regression <- sum((caret_model_MLR_Pred_Train-mean(caret_trainset$Prevalence))^2)

# fraction of variability explained by the model
cat("\nTrain R^2 fraction of variability explained by the model :", SS.regression/SS.total )
## 
## Train R^2 fraction of variability explained by the model : 0.7498116
# caret_model_MLR_L2

Calculate \(R^2\) for test set https://stackoverflow.com/questions/25691127/r-squared-on-test-data

# True y values:
# head(caret_testset$Prevalence)

# Predicated y values (y hat):
# head(caret_model_MLR_Pred)

SS.total      <- sum((caret_testset$Prevalence - mean(caret_testset$Prevalence))^2)
SS.residual   <- sum((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
SS.regression <- sum((caret_model_MLR_Pred - mean(caret_testset$Prevalence))^2)

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
# cat("\nNOT Test fraction of variability explained by the model :", test.rsq )

# fraction of variability explained by the model
cat("\nTest R^2 fraction of variability explained by the model :", SS.regression/SS.total )
## 
## Test R^2 fraction of variability explained by the model : 0.8798161

Enhanced MLR using Regularization: L2 Ridge

# Caret MLR with Regularization
# possible method: boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV"
cv_control <- trainControl(method = "repeatedcv",   
                           number = 10,     # number of folds
                           repeats = 5)    # repeated N times

caret_model_MLR_L2 <- train(Prevalence ~ .,
               data = caret_trainset,
               method = "ridge",  # Try using "lasso"
               trControl = cv_control,
               preProcess = c('scale', 'center'))  # Auto pre-process data
#
caret_model_MLR_L2
## Ridge Regression 
## 
## 1440 samples
##    4 predictors
## 
## Pre-processing: scaled (55), centered (55) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1297, 1296, 1297, 1296, 1295, 1297, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0e+00   2.949273  0.7532727  1.971258
##   1e-04   2.949261  0.7532723  1.971315
##   1e-01   3.000592  0.7444626  2.054192
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was lambda = 1e-04.
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=2)
ggplot(varImp(caret_model_MLR_L2))

# plot(varImp(caret_model_MLR_L2))
caret_model_MLR_Pred <- predict(caret_model_MLR_L2, caret_testset)
head(caret_model_MLR_Pred)
##        22        27        33        36        41        55 
##  9.384760  7.826531  9.667148 10.606323 12.066339 13.398437
# MSE
mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
## [1] 5.838206
# RMSE
sqrt(mean((caret_testset$Prevalence - caret_model_MLR_Pred)^2))
## [1] 2.416238

Calculate \(R^2\) for entire train set (all folds) https://stackoverflow.com/questions/25691127/r-squared-on-test-data

caret_model_MLR_Pred_Train <- predict(caret_model_MLR_L2, caret_trainset)

SS.total      <- sum((caret_trainset$Prevalence-mean(caret_trainset$Prevalence))^2)
SS.residual   <- sum(residuals(caret_model_MLR_L2)^2)
SS.regression <- sum((caret_model_MLR_Pred_Train-mean(caret_trainset$Prevalence))^2)

# fraction of variability explained by the model
cat("\nTrain R^2 fraction of variability explained by the model :", SS.regression/SS.total )
## 
## Train R^2 fraction of variability explained by the model : 0.7699436
# caret_model_MLR_L2

Calculate \(R^2\) for test set https://stackoverflow.com/questions/25691127/r-squared-on-test-data

# True y values:
# head(caret_testset$Prevalence)

# Predicated y values (y hat):
# head(caret_model_MLR_Pred)

SS.total      <- sum((caret_testset$Prevalence - mean(caret_testset$Prevalence))^2)
SS.residual   <- sum((caret_testset$Prevalence - caret_model_MLR_Pred)^2)
SS.regression <- sum((caret_model_MLR_Pred - mean(caret_testset$Prevalence))^2)

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
# cat("\nNOT Test fraction of variability explained by the model :", test.rsq )

# fraction of variability explained by the model
cat("\nTest R^2 fraction of variability explained by the model :", SS.regression/SS.total )
## 
## Test R^2 fraction of variability explained by the model : 0.8987872
<a href="">
     <img src="" width="750" align="center">
</a>

Workshop Submission

<h3>
    What to submit?
</h3>
<p>
    Create predictive model for Multi Class Classification of ASD Prevalence Risk Level (Low, Medium, High, Very High) using Caret package's multinom logistic regression algorithm.
</p>

References:

https://daviddalpiaz.github.io/r4sl/the-caret-package.html

http://topepo.github.io/caret/index.html

https://cran.r-project.org/web/packages/caret/caret.pdf

# Code example of multinomial logistic regresion using Iris flower dataset
iris[c(1:3, 51:53, 101:103), ]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
iris_model = train(Species ~ ., data = iris, method = "multinom", 
                     trControl = trainControl(method = "cv", number = 5), trace = FALSE)
iris_model
## Penalized Multinomial Regression 
## 
## 150 samples
##   4 predictors
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 120 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa
##   0e+00  0.9533333  0.93 
##   1e-04  0.9600000  0.94 
##   1e-01  0.9733333  0.96 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<a href="">
     <img src="" width="750" align="center">
</a>

Excellent! You have completed the workshop notebook!

Connect with the author:

This notebook was written by GU Zhan (Sam).

Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.

Copyright © 2020 GU Zhan

This notebook and its source code are released under the terms of the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

<a href="">
     <img src="" width="750" align="center">
</a>

Appendices

<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>

https://github.com/telescopeuser/S-SB-Workshop

<a href="https://github.com/dd-consulting">
     <img src="../reference/GZ_logo.png" width="60" align="right">
    https://github.com/dd-consulting
</a>